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I  Theme 

\ 


The  last  few  years  have  seen  rapid  advances  in  a  number  of  new  technologies  aimed  at  improving  the 
storage  and  retrieval  of  information.  Information  Scientists  may  be  aware  of  their  potential  but  may  have 
reservations  concerning  the  absence  of  suitable  standards  or  the  short  life  associated  with  some 


developments.  ' 


This  Specialists'  Meeting  brought  together  a  group  of  experts  to  explain  the  practicalities  of  applying 
these  new,  powerful  technologies,  their  successes  and  failures.  Topics  addressed  included  Artificial 
Intelligence,  CD  ROM,  Hypertext  and  Local  Area  Networks,  within  the  framework  of  finding  practical 
solutions  to  the  problems  of  achieving  efficient  and  effective  information  transfer.  The  meeting  had  the 
following  objectives:^ 

4-\  To  present  first-hand,  practical  experience  of  initiatives  to  improve  Information  Transfer,  by  experts 
J  in  the  field, 


^  To  provide  up-to-date  information  on  the  new  technologies  which  have  been  successfully  used  to 
promote  efficient  and  effective  information  transfer. 


The  meeting  was  directed  particularly  at  information  providers  and  u.sers,  especially  those  acting  within 
the  NATO  Community. 


Theme 


Ces  demiereit  ann^s  ont  etc  marqu^  par  lessor  d'un  certain  nombre  de  nouvelles  technologies  visant  a 
ameliorer  le  stoclu^e  et  la  recher^  d'informations.  Bien  que  les  documentalistes  soient  conscients  des 
possibility  de  ces  technologies,  its  emettent  parfois  des  ryerves  sur  I’absence  de  normes  ad^^uates  dans 
ce  domaine  et  sur  la  dur^  de  vie  relativement  courte  de  certains  de  ces  developpements. 

Cette  reunion  de  spyialistes  a  rassemble  un  groupe  d'cxperts  charge  d'examiner  les  details  pratiques  de 
ia  mise  en  oeuvre  de  ces  nouvelies  teduiologies  performantes  et  de  dresser  le  bilan  des  succ^  et  des 
yhecs.  Les  sujcts  abordy  ont  etc  llntelligence  artificielle,  I’Hypertcxt,  le  CD  ROM  cl  les  reseaux  locaux 
(LAN).  Ces  sujcts  ont  ete  examines  dans  foptique  d'une  recherche  de  solutions  pratiques  aux  problemes 
de  la  transmission  effective  et  efRcace  des  donnas.  La  reunion  a  eu  pour  objectif: 

—  de  presenter  des  enseignements  pratiques  tires  d'initiatives  prises  par  des  spyialistes  dans  le 
domaine  de  la  transmission  des  donnas 

—  de  foumir  des  informations  actualisys  sur  les  nouvelles  technologies  qui  ont  ete  employes  avec 
succes  pour  la  transmission  effective  et  efficace  des  donnees. 


La  reunion  s’est  adressy  done,  cn  particulier.  aux  foumisseurs  et  aux  utilisateurs  de  Tinformation,  dans 
le  contexte  des  Changes  entre  pays  membres  de  I'OTAN. 
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TECHNICAL  EVALUATION  REPORT 


Walter  Blados 

Scientific  and  Technical  Information  Program 
NASA  Headquarters  (Code  JTT) 
Washington  DC  20546 
USA 


SUMMARY 

This  Technical  Evaluation  Report  is  in  two  sections. 
Section  1  provides  a  brief  summary  of  each  of  the  papers 
as  presented.  Section  2  comprises  comments, 
conclusions,  and  recommendations  partly  arising  as  a 
consequence  of  the  meeting. 


SECTION  1 

SUMMARY  OF  PAPERS  AS  PRESENTED 
KEYNOTE  ADDRESS 

The  author  provided  a  brief  overview  of  the  topics  that 
were  to  be  presented  in  subsequent  papers.  He  described 
expert  systems  which  are  used  for  information  retrieval, 
but  pointed  out  that  most  of  these  applications  have  not 
reached  an  extended  commercial  use.  and  are  still  in 
research  and  developmental  states. 

The  next  topic  discussed  was  the  use  of  CD-ROM  in 
information  retrieval,  including  the  advantages  and 
shortcomings  of  their  use.  Information  stored  on  CD- 
ROM  has  a  growing  importance  in  the  database  market, 
and  the  number  of  databases  on  CD-ROM  will  increase 
dramatically.  The  use  CD-ROM  is  increasing  in  the 
use  of  parts  catalogs  for  equipment  and  machines,  as  well 
as  for  maintenance  manuab.  Another  increasing  use  of 
CD-ROM  is  as  storage  media  for  multimedia  databases, 
for  staring  text,  images,  graphics  and  sound. 

Hypertext  systems  allow  the  users  of  information  retrieval 
systems  to  identify  the  relationships  among  the  logic 
records  of  a  database,  and  to  display  the  related 
infonnalion  on  a  microcomputer  monitor. 

Non-boolean  search  strategies  were  discussed,  including 
document  vectors  criterion,  cluster  analysis,  method  of 
fuzzy  sets,  protiabilistic  retrieval,  and  search  by  means  of 
the  "nearest  neighbor",  these  techniques  are  still  in 
developmental  stages,  but  the  number  of  applications  is 
clearly  growing. 

L4Mly.  the  author  discussed  Local  Area  Networks  (LAN) 
which  m  communication  facilities  which  link  devices  in 


a  small  area.  LANs  optimize  the  concept  of  online 
communications  by  sharing  expensive  hardware,  sharing 
software  and  sharing  communication  modems. 

PAPER  1  -  LOCAL  AREA  NETWORKS  (LAN) 

This  paper  discussed  the  man-machine  interface  problems 
encountered  during  installation,  implementation  and  post- 
implementation  of  the  LAN  installed  at  the  United 
Kingdom's  Defence  Research  Information  Centre  (DRIC) 
in  Glasgow.  The  LAN  was  considered  necessary  to 
increase  the  computing  capability  at  DRIC,  and  to 
introduce  enhancement  and  automation  of  maintaining 
card  index  systems,  maintenance  of  records  relating  to 
distributions  and  requests,  and  the  manual  preparation  of 
despatch  notes,  receipts,  address  labels  and  information 
relating  to  distribution  and  special  restrictions.  During 
this  process,  technology  was  not  a  problem,  but  rather  the 
human  aspect,  the  man-machine  interface.  The  majority 
of  the  problems  encountered  were  easily  rectined  by 
software  amendments.  Some  of  the  man-machine/human 
aspect  problems  encountered  were  that  terminals  were 
wrongly  situated,  unidentified  tasks  and  fuiKtions  were 
discovered,  and  user  complaints  about  poor  response 
times. 


PAPER  2  GATEWAYS  "INTELLIGENTS" 

This  paper  discussed  the  advantages  of  a  gateway, 
namely,  only  one  contract  necessary  to  access  sevet^ 
hosts;  only  one  automatic  procedure  (o  the  gateway;  no 
confusion  between  several  similar  languages;  possibility 
of  multibase  and  muldhosl  queries;  adapted  dialogue  (user 
profile);  choice  of  the  mote  relevant  bases  for  a  query, 
user  friendly  interface;  and  efficiency. 

The  imenogation  of  a  database  n  natural  language  is 
considered  very  interesting;  the  paper  discussed  natural 
language  query  and  linguistic  processing,  and 
monolingual  reformulation.  The  author  states  dial 
multilingual  access  is  necessary  because  in  many  cases, 
a  user  needs  documents  that  may  be  in  two  or  mote 
languages.  Multilingual  access  is  useful  to  provide  access 
to  a  database  in  a  language  the  user  does  not  know  (in 
this  case  the  system  makes  a  computer-aided  decision  for 
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a  manual  or  automatic  translation),  in  a  language  the  user 
can  read  even  if  he  is  unable  to  make  a  query  in  it 
efficiently,  or  in  which  documents  in  different  languages 
are  mixed. 


PAPER  3  •  USE  OF  EXPERT  SYSTEMS  AS  USER 
INTERFACE  IN  INFORMATION  RETRIEVAL 

The  author  provided  a  discourse  on  the  difference 
between  user-friendly  interfaces,  which  essentially  provide 
responses  to  the  queries  of  the  retriever,  and  intelligent 
interfaces  that  translate  the  retrieval  query  into  in-depth 
comprehension. 

Intelligent  interfaces  are  based  on  a  refined  representation 
of  domain  knowledge  (the  database  is,  or  is  completed 
by,  a  knowledge  base).  In  order  to  permit  a  better  return 
of  hits  in  response  to  a  query,  a  more  in-depth  analysis  of 
the  query  and  documents  is  necessary. 

Multi-expert  systems  incorporate  more  and  more  complex 
hybrid  systems,  which  essentially  means  the  integration 
of  single  interfaces  into  a  multi-architectured  system. 
The  expert  systems  are  replaced  by  multi-expert  systems 
formed  by  specialist  modules  and  strategy  modules. 

The  interfaces  which  permit  the  use  of  natural  language 
to  interrogate  a  database  are  considered  the  first  stage  in 
the  steps  to  multi-expert  systems.  The  second  stage  is  the 
devek^ent  of  intelligent  interfaces  formed  by  a  data 
base  joined  to  a  knowledge  base. 

The  application  of  the  techniques  of  elaboration  of 
knowledge  bases  bom  the  contents  of  the  documents  and 
the  preparation  of  paraphrases  by  means  of  lexic  or 
linguistic  transformations  are  necessary  to  prepare  for 
multi-expert  systems. 


PAPER  4  SEARCH  STRATEGIES  IN  NATURAL 
LANGUAGE 

The  author  discussed  online  searching  prt^ems,  such  as: 
what  relevant  databases  exist;  how  do  you  access  them 
(autodial,  gateway);  how  do  you  retrieve  information 
from  them  (terminology,  search  strategy);  and  what  can 
you  do  with  the  retrieved  information  (post-processing)? 

The  author  then  prescribed  and  described  methods  for 
making  online  searching  easy  for  end-users,  with 
emphasis  given  to  parsing  and  natural  language  interfaces 
to  the  databases  (structural  Query  Language. 
INTELLECT.  SAPHIR.  GURU,  SCISOR).  Natural 
language  interfaces  to  muttidiscipltnaty  bibliogiaphic 
databases  include  CITE,  OKAPI. 
ALEXIS/DIANEGUIDE.  PLEXUS/rC»!E/MITI.  and 
DGIS/STINET. 


The  author  continued  with  a  discussion  of  vocabul^ 
control  and  thesaurus  aids,  emphasizing  the  bilingual 
NATO  Thesaurus.  The  author  concluded  with  an 
explanation  of  the  Netherlands  search  strategy  -  called 
'Intelligent  Information  RetrievaT  -  which  appears  to 
provide  very  good  recall. 


PAPER  S  NON  BOOLEAN  SEARCH  METHODS  IN 
INFORMATION  RETRIEVAL 

All  search  strategies  are  based  on  a  comparison  between 
the  query  and  stored  documents.  At  times,  this 
comparison  is  indirect  (when  the  query  is  compared  with 
clusters)  or  direct  (when  the  query  is  compared  with 
documents  within  a  context  of  a  given  document). 
Oftentimes  the  comparison  is  iterative  in  that  the  user 
provides  feedback  ^ter  a  first  comparison  which  will 
affect  the  next  comparison. 

Search  strategies  consist  of  Boolean  search,  matching 
functions,  and  serial  search.  The  basic  instrument  for 
trying  to  separate  the  relevant  from  the  non-relevant 
documents  is  a  matching  function.  Cluster  base  retrieval 
is  based  on  the  hypothesis  that  closely  associated 
documents  tend  to  be  relevant  to  the  same  requests. 

Feedback  and  evaluation  are  necessary  to  improve 
performance  of  a  system  by  taking  account  of  past 
performance.  Basic  evaluation  measures  of  search  and 
retrieval  ate  efficiency  versus  effectiveness,  where 
efficiency  is  measured  using  speed  and  storage  overhead, 
and  effectiveness  is  measured  using  relevance. 


PAPER  6  -  HYPERTEXT  AND  HYPERMEDIA 
SYSTEMS  IN  INFORMATION  RETRIEVAL 

Current,  up-to-date  definitions  were  cited,  including: 
document  •  reemded  information  structured  for 
human  consumption; 

hypertext  -  a  system  of  computer-supported,  non¬ 
sequential  informatkxi  processing; 
hypermedia  -  multimedia  dynamic  links  among  units 
of  information; 

multimedia  -  multiple  forms  of  information;  and 
hyperprogramming  -  the  process  of  creating 
hypertext  or  hypermedia  applications. 

Traditional  information  access  methods  sue  fundamentally 
linear,  th^  is.  a  unit  of  information  is  read  or  viewed 
from  beginning  to  end,  with  the  document  designed  to  be 
accessed  with  a  clear  path  through  the  information  from 
beginning  to  end.  On  the  other  hand,  hypertext  systems 
may  provide  the  usm  with  an  initial  linear  access  method, 
but  at  any  given  location  in  the  information,  the  user  has 
the  option  of  selecting  one  to  many  further  references. 
With  such  hypertext  systems,  the  end  user  can  pursue 
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data  references  by  following  a  self-selected  trail  or 
combination  of  trails  through  the  data. 

Hypermedia  and  related  technology  can  improve  both 
formal  and  informal  information  transfer  despite  barriers 
of  distance  and  time  (asynchronous  annotation  of 
information  nodes)  or  language  (computer-aided  systems 
such  as  SYSTRAN). 

The  main  added-value  of  hypermedia  systems  to  the  STI 
community  lies  in  the  ability  of  hypermedia  to  handle  the 
full  spectrum  of  STl’s  pragmatic  content,  from  data 
manipulation  to  video  display. 

Several  case  studies  on  hypermedia  development  were 
presented,  including  Experiment  Documentation 
Informatirxi  System  (EDIS),  Life  Sciences  Interactive 
Informatirxi  Recall  (LSIIR),  Decision  Support  System 
Shell,  Knowledge  Base  Browser  (KBB),  the  Space  Station 
Freedom  User  Interface  Language.  PROJECT 
EMPEROR- 1,  and  Clinical  Practice  Library  of  Medicine 
(CPLM). 


PAPER  7  -  AUTOMATED  INPUT  INTO 

DATABASES:  OCR  AND  DESCRIPTIVE 
CATALOGING 

Currently,  paper  still  remains  the  most  important  medium 
of  information  exchange.  The  input  process  of  this  paper 
into  a  bibliographic  database  is  a  critical  operation:  it 
comprises  more  than  70%  of  all  the  costs  of  document 
installation. 

Hardware  and  software  for  automated  input  into  databases 
is  available;  the  transformation  prrxess  from  written 
material  into  a  machine  readable  database  format  consists 
of  scanning  at  the  graphical  level,  optical  character 
recognition  at  the  character  level,  descriptive  cataloging 
at  the  document  structure  level,  and  subject  indexing  at 
the  content  level. 

AUTCXTAT  was  developed  to  produce  records  for  a 
bibliographic  database;  its  pototype  application 
environment  is  INIS  (the  International  Nuclear 
Information  System  of  the  International  Atomic  Energy 
Agency,  based  in  Vienna).  AUTOCAT  recognized 
information  elements  in  the  machine  readable  journals, 
and  normalizes  the  information  elements  aid  enters  them 
into  the  target  records  -  both  steps  as  stipulated  by  INIS 
rules. 


PAPER  8  -  DATA  COMPRESSION  TECHNIQUES 

Data  compression  is  becoming  accepted  as  a  means  to 
reduce  the  vohime  of  documents,  to  make  better  use  of 
available  resources  like  communication  channels  and  disk 


storage. 

Data  compression  is  essentially  a  matter  of  modelling  the 
source  of  the  data,  and  is  sometimes  referred  to  as 
"source  coding". 

Data  compression  algorithms  are  divided  into  reversible 
algorithms  which  only  change  the  representation  of  the 
data  into  a  more  efficient  one,  and  non-reversible 
algorithms  which  make  only  an  approximate 
representation  of  the  original  data. 

Currently,  data  compression  techniques  for  text 
compression  are  readily  available  on  most  workstations 
and  personal  computers.  A  large  share  of  the  algorithms 
are  in  the  public  domain  and  can  be  freely  used. 


PAPI  R  9  -  COMPUTERIZED  PROPERTY  DATA 
FOR  ENGINEERING  MATERIALS  -  AN 
OVERVIEW 

The  Material  Properties  Database  (MPD)  is  a  collection 
of  data  items  whose  values  correspond  to  various  large 
scale  properties,  parameters  or  attribules  of  materials,  and 
are  critically  evaluated  or  validated  by  experts  prior  to 
their  being  included  in  the  database. 

Tne  genesis  and  development  of  the  MPD  required  a  vast 
effort,  because  the  data  is  difficult  to  deal  with.  For 
example,  engineering  properties  such  as  creep  strength  of 
aluminum,  are  not  intrinsic  properties;  rather,  they  will 
change  as  the  material  is  loaded  and  as  it  ages.  etc.  Also, 
a  particular  material  may  be  known  by  various 
nomenclatures  in  various  countries  or  communities. 
Hence,  entries  for  these  properties  must  include  data 
about  the  data,  that  is.  a  set  of  data  descriptors  and  other 
associated  information  that  characterizes  the  individual 
data  values. 

Because  of  the  cost  to  obtain  and  disseminate  materials 
information,  and  because  it  is  so  vital  to  the 
manufacturing  industry,  materials  information  must  be 
regarded  as  an  international  commodity.  Hence, 
standards  are  being  and  have  been  developed  to  relate  to 
the  quality  and  reliability  data,  database  system 
management,  system  capabilities  and  data  security  and 
integrity. 


PAPER  10  -  FACILITATING  THE  TRANSFER  OF 
SCIENTinC  AND  TECHNICAL  INFORMATION 
WITH  SCIENTinC  AND  TECHNICAL  NUMERIC 
DATABASES 

Numeric  databases  are  collections  of  information  and 
data,  and  contain  both  data  and  metadata,  or  textual 
information  relating  to  the  data.  Scientific,  technical  and 
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engineering  databases  comprise  (he  second  highest  subject 
category  of  all  numeric  databases  next  to  business 
databases. 

In  an  effort  to  better  serve  the  scientist  and  engineer,  the 
U.S.  Department  of  Defense  Defense  Technical 
Information  Center  (DTIC),  through  its  Defense  Gateway 
Information  System  (DGIS),  provides  the  end  user  with 
an  access  mechanism  to  databases,  and  through  its  Multi- 
Type  Information  and  Data  Analysis  System  (MIDAS) 
will  provide  a  capability  for  the  end  user  to  process 
bibliographic  information  and  numeric  data. 

DTIC  conducted  an  S&T  Numeric  Database  Technology 
Assessment  to  more  thoroughly  understand  the 
information  and  data  resource  needs  of  the  scientist  and 
engineer,  as  well  as  the  computing  environment  in  which 
they  function  and  operate.  DTIC  has  also  identified 
scientific  and  technical  numeric  databases  throughout 
goverrunent  and  indusUy.  The  information  gleaned 
during  this  assessment  wilt  determine  the  extent  of  the 
investment  DTIC  will  make  in  providing  expanded 
services  to  their  users. 


SECTION  2 

COMMENTS.  CONCLUSIONS  AND 
RECOMMENDATIONS 

Rather  than  dwell  on  an  evaluation  of  each  presentation. 
I  will  just  make  some  generalizations.  The  Keynote 
Address  served  as  a  condensation  of  the  status  of  the 
current,  state-of-the-art  techniques  and  technologies  that 
were  to  be  presented  at  the  meeting.  It  was  an  excellent 
introduction  to  the  meeting. 

The  subsequent  papers  presented  reports  of  experiences, 
curren:  assessments  of  developing  technologies,  and 
assessments  and  evaluations  of  current  systems,  software 
and  hardware.  Each  topic  was  well  understood  by  its 
author.  Time  constraints  at  times  prevented  an  in-depth 
total  coverage.  However,  the  question  and  answer  periods 
accommodated  pressing  questions  and  problems.  The 
speakers  were  well  prepared  and  obviously  experienced 
in  their  areas  of  specialisadoa  The  technology  discussed 
was  relevant  to  the  interests  and  objectives  of  the  TIP 
Specialists  Meeting.  However,  the  papers  were  not 
totally  in  concert  with  the  stated  theme,  for  example,  the 
following  omissions  were  noted: 

"...absence  of  suitable  standards..." 

not  often  mentioned  except  in  passing. 

"...short  life  associated  with..." 
not  discussed  at  aU. 


"...successes  and  failures...” 

successes  were  highlighted;  few  failures  were 
mentioned. 

"...finding  practical  soluuons...” 
not  many  papers  did. 

The  past  several  TIP  meetings  have  dwelt  on  technology, 
existent  and  emerging,  that  will  have  an  impact  on  how 
information  resources  will  be  allocated  and  managed. 
This  is  all  well  and  good,  but  is  this  new  technology 
which  is  being  developed,  developed  with  the  user,  both 
the  intermediary  and  end  user,  in  mind? 

It  is  well  to  be  aware  of  current  developments.  However, 
I  feel  that  we  must  every  so  often  come  back  to  basics 
and  look  at  what  we  have  done,  where  we  are.  and  what 
will  be  our  future;  keeping  in  mind  that  all  developments 
must  be  in  concert  with  our  users  and  their  requirements. 

The  TIP  Terms  of  Reference  stale  that  the  Panel  is 
concerned  with  all  aspects  of  the  management  of 
scientific  and  technical  information  as  an  integral  part  of 
the  aerospace  (and  defense)  research  and  development 
process.  Timely,  accurate  and  relevant  STI  is  critical  to 
the  R&D  process;  it  is  an  incredibly  valuable  resource 
that  directly  affects  the  cost  of  performing  a  technical 
task,  the  quality  of  the  results,  and  productivity. 

Unfortunately,  during  the  past  few  years,  STI  program 
managers  have  been  battling  budget  cuts,  coping  with 
personnel  cuts  and  losses,  acquiring  new  equipment,  etc. 
The  relationships  between  R&D  managers  and  STI 
managers  have  loosened;  more  and  more  they  work  as 
separate  communities,  with  the  STI  community  serving  a 
passive  role  by  responding  to  service  requests  of  the 
R&D  community. 

STI  Programs  must  refocus  and  concentrate  on  how  better 
to  support  the  R&D  community,  as  well  as  how  to 
support  scientific  and  technical  productivity.  This 
apparent  gap  between  R&D  managers  and  the  STI 
managers  must  be  filled;  information  specialists  must  be 
actively  involved  in  all  stages  of  R&D.  This  participation 
must  not  be  a  passive  "Don't  call  us  we’ll  call  you",  but 
the  result  of  active  membership  on  the  R&D  team. 

There  are  several  trends  emerging  which  have  a 
significant  impact  on  the  conduct  of  science,  research  and 
development  and  the  corollary  management  activities,  and 
which  dictate  that  generic  issues  of  STI  be  addressed. 
The  trends  include  the  use  of  information  technology,  the 
growth  of  interdisciplinary  research,  and  an  increase  in 
international  collaboration. 

Information  technology,  which  has  dramatically  changed 
the  conduct  of  research,  has  brought  forth  a  need  to  better 
understand  and  manage  its  exploitation.  Computerized 


inslrumcnls  gather  data  many  orders  of  magnitude  greater 
than  previous  methods.  Telecommunication  capabilities 
link  researchers  to  computing  facilities  with  vast 
capabilities  and  with  data  sources  not  constrained  by 
geographical  location.  Data  are  available,  not  only  in 
computerized  databases,  but  also  from  sensing  and  other 
data  gathering  instruments.  New  analytical  approaches 
are  possible  through  graphics,  color  enhancement, 
animation,  and  other  visualization  techniques.  With  this 
ever  growing  capability,  there  is  a  need  to  help  teach 
researchers  to  belter  use  it,  to  develop  better  ways  to 
store,  retrieve  data  and  to  maintain  its  integrity,  and  to 
determine  how  to  assure  intellectual  properly  rights  in  an 
electronic  rtetwork. 

Many  of  the  significant  research  challenges  today  arc 
interdisciplinary  in  nature,  which  requires  expanding  the 
circle  of  collaborators,  as  well  as  the  range  of  information 
sources.  A  network  of  communications  links  will  soon 
develop  worldwide,  to  link  personal  computers,  work 
stations,  data  bases,  peripherals,  and  information  utilities. 
Information  systems  will  become  transparent,  and  will 
facilitate  the  flow  of  information  and  meaning  among 
people.  Consequently,  we  will  be  able  to  focus  on 
content  not  technology.  Responsive  expert  advice, 
information,  and  solutions  will  be  at  our  fingertips:  we 
wilt  find  ourselves  receiving  more  stimulation  and 
excitement  from  the  systems  than  the  energy  we  put  into 
them.  We  will  become  more  purposeful,  growing,  and 
professional  than  we  are  now. 

Notwithstanding  these  communications  networks  and 
large  databases,  the  different  methodologies,  vocabularies, 
and  cultures  of  individual  disciplines  create  obstacles  to 
efficient  information  exchange.  Systems  need  to  be 
designed  to  accommodate  users  who  were  not 
immediately  involved  in  the  original  research.  Merging 
existing  data  collections  from  different  fields  to  perform 
analyses  creates  new  problems.  It  becomes  exbcmely 
difficult  to  compare  data  that  were  derived  using  different 
techniques  or  approaches.  Contributing  to  this  problem 
is  the  lack  of  standards  for  data  exchange  formats  which 
hamper  the  building  of  these  multidisciplinary  databases. 
The  bottom  line  is  that  we  must  be  prepared  to  import 
external  information  to  support  the  internal  R&D  process, 
assure  real-time  delivery  of  information  to  support  the 
transfer  and  transition  of  technology  within  the  R&D 
community,  and  be  able  to  export  some  results  to  remain 
competitive  in  the  R&D  arena,  as  well  as  to  provide 
visibility  to  the  organization. 

Thc,se  problems  are  further  compounded  by  the  growing 
inlcmationalizalion  of  .science.  STI  is  being  produced, 
enhanced,  and  stored  around  the  globe.  Single  countries 
in  some  cases  are  acknowledged  leaders  in  select 
scientific  and  technical  disciplines.  Many  of  the  major 
re.scarch  effons  involve  worldwide  data  collection.  Not 
only  arc  a  vaiety  of  disciplines  involved,  but  .scientists 


from  around  the  world  are  participating  in  these  efforts. 

The  users  in  these  projects  are  distant  geographically  as 
well.  Global  economies  dictate  that  every  effort  be  made 
to  reduce  unnecessary  product  and  service  development 
cost.  Communications  networks  facilitate  the  exchange 
of  ideas  and  access  to  remote  databases,  but  there  is  still 
much  progress  that  needs  to  be  made  in  making  systems 
more  transparent  and  in  developing  common  protocols. 

Hence,  the  pace  of  data  collection,  the  growth  of 
international  approaches  to  research,  and  the  tendency  to 
cross  Paditional  disciplinary  boundaries  all  cast  a  new 
perspective  on  earlier  STI  issues,  and  raise  new 
challenges  for  effectively  providing  critical  information  to 
the  end  user. 

The  issues  to  be  addressed  and  resolved  are  numerous, 
including  the  transparency  of  access  to  vastly  expanded 
and  distributed  elecuonic  resources;  merging  data  from 
numerous  sources;  greater  data  validation;  closer 
cooperation  between  the  user  community  of  scientists, 
engineers,  and  manag  rs  and  the  information  system 
designers;  the  long-term  viability  of  electronic  data;  and 
expanded  resource  commitments  to  support 
technologically  advanced  information  systems;  archiving 
large  scientific  databases;  what  STI  should  be  retained, 
where  datasets  should  reside;  what  formats  should  be 
used,  how  can  they  be  physically  maintained;  and  how  to 
reduce  dependence  on  specific  hardware  and  software. 

Notwithstanding  the  above  issues,  it  will  also  be 
necessary  to  better  understand  the  knowledge  uansfer 
process.  It  will  be  necessary  to  establish  a  research 
agenda  to  address  these  and  other  issues  related  to  STI. 
Not  only  must  information  managers,  but  policy  makers 
involved  in  the  science  and  technology  programs  as  well, 
need  to  understand  the  relationship  of  STI  to  the  R&  D 
process,  namely,  that  knowledge  transfer  is  an  inseparable 
pan  of  R&D.  Innovation  is  a  complex  process  composed 
of  multiple  and  inlcrrelatcd  sy.stcm.s.  A  better 
understanding  of  knowledge  diffusion  by  policy  makers. 
R&D  managers,  scientists,  engineers,  and  information 
specialists  should  result  in  better  defining  policy  and 
programs  that  will  enhance  the  productivity  of  the  R&D 
community,  and  in  turn  enhance  competitiveness. 

As  STI  concerns  move  beyond  the  parochial  interest  of 
particular  disciplines,  as  linkages  occur  with  the 
networking  community,  and  as  the  Oends  toward 
interdisciplinary  research  on  a  global  sc.alc  become  more 
pervasive,  an  expanded  R&D  user  community  is 
developing.  The  user  community  must  voice  legitimate 
concerns  about  both  technical  and  policy  issues  associated 
with  STI.  The  user  community  must  identify  common 
concerns  about  STI  access  and  in  building  sy.stcms  that 
will  accommodate  the  needs  of  future  government 
scientific  and  technical  initiative. 


We  are  in  Ihe  dynamics  of  leehnologica)  pull  versus 
administrative  lag.  Administrative  lag  retards  the 
development  and  use  of  the  new  information  systems  and 
technologies.  The  industrial  age  from  which  we  arc 
departing  needed  us  to  be  interchangeable  cogs  in  a 
machine,  tumed-off  and  emotionless,  mechanical,  routine, 
controllable,  and  consistent.  The  new  information  age 
into  which  we  are  entering  needs  us  to  be  growing, 
experimental,  creative,  enthusiastic,  risking,  and  taking 
initiative. 

CONCLUSIONS/RECOMMENDATIONS 

•  STI  m  jSt  be  considered  as  an  R&D  re.sourcc.  essential 
to  the  continued  success  and  innovation  of  ihe  R&D 
community.  Not  to  be  overUxiked  is  the  fact  that  STI  has 
co.sls:  co.sts  in  collection,  internal  and  cxtenial 
communications,  prc.tessing  and  storage,  archiving  and 
di.sposal.  and  in  skilled  staff  used  in  all  of  the  activities 
above.  It  is  also  noteworthy  to  mention  that  although  STI 
is  used  mainly  by  the  scientists  and  engineers  of  the  R&D 
community,  it  docs  have  value  and  is  required  at  Ihe 
policy  level,  as  well  as  at  the  managerial  level. 

•  STI  management  means  more  than  simply  developing 
more  sophisticated  information  transfers  sy.slem";  rather, 
it  means  providing  the  means  to  exploit  both  internal 
(corporate)  and  pertinent  external  (other  governmental/ 
itidusirial/  foreign)  itiformation  to  meet  the  requirements 
of  the  R&D  community. 

•  Practical  steps  must  be  taken  to  improve  the  quality, 
timcline.ss.  and  accuracy  of  information  which  will  have 
an  impact  on  the  efforts  of  the  R&D  community.  By 
recognizing  problems  and  taking  appropriate  action  to 
correct  them,  information  handling  costs  can  be  reduced. 
Given  the  size  of  expenditure  on  information  handling, 
even  small  improvements  in  Ihe  efficient  u.sc  of 
inftxmalion  can  result  in  very  large  potential  savings. 

•  Effective  use  of  information  adds  value  to  all  the 
activities  of  the  R&D  community.  It  means  improved 
quality  of  information  for  more  effective  planning;  more 
effective  and  efficient  discharge  of  functions  and  higher 
quality  of  service;  more  accurate,  more  cost-effective 
information;  reduced  expenditure  on  the  collection, 
communication  and  storage  of  unncccs.sary  data;  and  a 
better  focused  information  system  investment. 

•  However,  it  is  only  in  close  concert  with  the  R&D 
community  that  we  can  make  the  most  effective  use  of 
information.  It  is  only  in  concert  with  the  R&D 
community  that  we  can  identify  and  specify  the  needs  for 
information  (including  its  content,  qiialilv.  and 
timeliness);  identify  the  most  appropriate  sources  of 
information  to  meet  these  needs;  identify  the  most 
appropriate  mechanism  for  the  delivery  of  this 
information;  and  establish  procedures  to  allow  data  from 


many  sources  to  be  brought  together  to  provide 
information  at  the  point  of  need.  In  short,  it  is  only  with 
the  help  and  cooperation  of  ihe  R&D  community  that  STI 
Programs  can  provide  information  services  which  are 
easily  accessible,  and  allow  users  to  find  Ihe  information 
they  need  with  Ihe  minimum  difficulty  and  minimum 
intervention  by  skilled  spcciali.sts. 

*  The  starting  point  for  information  management  must 
be  an  understanding  of  Ihe  users’  business,  its  aims  and 
objectives,  and  how  these  are  translated  into  the  functions 
it  performs.  It  is  then  possible  to  derive  or  work  out  Ihe 
total  information  needed  to  carry  out  this  mission.  It  is 
important  to  note  that  the  product  which  results  from 
processing  Ihe  required  infomiation  is  very  important,  and 
should  employ  language  familiar  to  Ihe  users  of  the 
information. 

*  Through  dialogue  and  support  of  the  R&D  community 
Ihe  differences  between  information  need  and  provision 
can  he  investigated,  and  this  investigation  will  determine 
where  it  is  necessary  to  make  up  the  deficiencies  or 
dispose  of  the  surplu.scs.  The  choice  of  delivery  systems 
depends  hugely  on  who  needs  Ihe  information,  how 
quickly,  how  frequently,  and  what  they  do  with  it. 
Exploitation  of  Ihe  information  slock  also  depends  on 
knowing  what  is  available,  and  on  being  able  to  identify 
whether  it  offers  a  contribution  to  the  requirements.  The 
tasks  arc  all  continuous,  requiring  constant  or  periodic 
review,  which  is  best  done  during  R&D  planning  stages. 

*  STI  management  must  become  a  part  of  the  accepted 
culture  of  the  R&D  community,  but  it  cannot  become  so 
unless  adopted  and  accepted  by  it.  A  start  should  be 
made  now  to  integrate  one’s  STI  Program  into  the  R&D 
infrastructure,  including  funding  and  operational  control. 
Within  the  R&D  infrastructure,  we  must  obtain 
management  commitment,  review  and  produce  policy 
reflecting  our  organizational  status,  allocate 
responsibilities  and  set  to  work  on  implementing  Ihe  true 
requirements  of  the  R&D  community. 
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In  this  keynote  adress  we  intend  to  give  a  brief 
overview  about  almost  all  the  topics  which  will 
be  discussed  during  this  Meeting.We  will  begin 
speaking  about  the  use  of  expert  systems  in  the 
information  retrieval  field.  Since  many  years 
ago  expert  systerms  are  being  used  for 
information  retrieval,  but  until  now  most  of 
these  applications  have  not  reached  an  extended 
commercial  area  and  are  limited  in  most  cases 
to  the  use  in  a  University  or  investigation 
Center. 

The  expert  systems  can  be  used  in  information 
retrieval  with  the  following  purposes; 
-Selection  of  a  suitable  Database  to  carry  out  a 
search,  among  a  set  of  Database  available  from 
a  set  of  Hosts. 

-Formulation  of  a  information  request  in 
natural  or  controlled  language. 

-Translation  of  the  information  request  to  a 
search  strategy  and  automatized  searching. 
-Displaying  of  some  search  results  and 
obtaining  from  them  new  search  terms  to  be 
included  in  a  new  search  strategy  and  iteration 
of  the  search  process  with  this  new  search 
strategy  to  reach  the  recall  and  precision 
requested  by  the  user. 

Which  are  the  characteristics  of  the  expert 
system  to  fullfill  these  requisites?  These 
characteristics  are  the  following; 

a) ability  to  carry  out  the  parsing  of  the  user 
information  requests. 

b) ability  to  control  the  man-machine 

interaction. 

c) program  ability  to  explain  its  capabilities, 
unfitnesses,  what  it  is  doing  and  what  it  does 
for. 

d) ability  to  identifiy  an  object  from  a 
description. 

ejability  for  heuristic  learning(by  trial  and 
error). 

Oability  to  ascertain  the  user  knowledges  on 
the  search  topic. 

g) abiiity  to  correct  user  errors. 

h) to  operate  in  user  friendly  mode. 

i) use  of  tutorials  to  guide  the  user. 

j) suitable  system  response  time  and 

k) to  be  able  to  help  user  in  the  search  strategy 
preparation. 

We  can  add  to  these  requisites  these  other  ones; 
t)ability  to  understand  information  requests  in 
natural  language 

2)ability  to  use  the  automatic  weighting  of 
search  terms  and  to  calculate  the  global  weight 


of  a  reference 

3)iterative  search  of  references  like  previously 
obtained  records. 

To  carry  out  all  these  requisites  the  expert 
system  must  operate  in  the  following  way;the 
expert  system  asks  the  user  for  the  concepts 
dealing  with  his  information  need  and  then  asks 
him  for  the  search  terms.  Then  explains  to  the 
user  the  various  operators  used  in  the  search 
process  and  show  the  user  the  initial  search 
strategy. 

Later  the  expert  system  asks  the  user  for  the 
wanted  recall  and  precision  and  to  classify  the 
search  in  a  broad  subject  classification  to  select 
the  Database  set  to  be  used  in  the  search.Then 
the  search  is  carried  out  and  some  records  are 
displayed  to  the  user  to  decide  the  iteration  of 
search  process,  to  broaden  or  to  restrict  the 
number  of  records  obtained  as  results  or  to  take 
out  new  search  terms  of  the  displayed  records. 
After  this  the  system  proceeds  if  neccesary.  to 
iterate  the  search  process  until  a  good  search 
result  is  reached. 

In  short  we  can  say  that  expert  systems  in 
information  retrieval  have  the  following 
objectives; 

to  give  information  over- 

to  help  the  user  for- 

to  carry  out  instead  of  the  user- 

to  avoid  the  user  the  knowledge  of  one  or  some 

of  the  following  processes; 

a) preliminary  processes( telecommunications  and 
host  selection,telecommunications  use,  host 
access  or  DB  selection) 

b) search  proce$s(search  strategy  planning,  search 
terms  selection,  iterative  search) 

c) handling  of  records  recovered  in  the  search 

d) ancillary  processes(errors  correction,  search 
data  recording,  etc.) 

e)  translation  of  information  requests  in  natural 
language  into  a  search  strategy. 

The  second  topic  to  be  discussed  in  this 
keynote  will  the  use  of  CD-ROM  in 
information  retrieval. 

The  high  storage  capacity  of  CD-ROM, 
between  SSO  and  600  Mbytes,  made  very 
interesting  its  use  in  automated  information 
retrieval,  since  a  bibliographic  Database  with 
1,000,000  records  can  be  stored  in  a  little 
number  of  CD-ROMs  (between  two  and  six  in 
accordance  with  the  records  average  size). 


f 


K-2 


The  use  of  CD-ROM  Databases  have  the 
following  advantages: 

-to  avoid  the  use  of  telecommunications  access 
to  a  main  computer  to  do  an  online  search  on 
large  Databases,  in  this  way  the  PTT  and  host 
payments  are  not  neccessary 
-possibility  of  extending  in  time  the  search 
without  any  of  the  above  mentioned  costs 
-to  use  the  CD-ROM  Database  only  is 
neccessary  the  payment  of  an  annual 
subscription,  receiving  a  large  fragment  of  a 
Database  or  all  the  Database. 

Among  the  shortcomings  are  the  following: 
-the  access  time  is  very  high  in  comparison 
with  the  online  one  and  this  gives  rise  to  a 
slower  information  retrieval  process.  If  we  take 
in  account  the  need  of  consulting  several  CD- 
ROMs  to  do  a  search,  it  will  be  possible  to  use 
40,  SO  or  60  minutes  in  doing  a  search  in  CD- 
ROMs,  if  the  number  of  file  years  to  be 
searched  is  enough  high. 

-the  time  delay  between  the  publication  of  a 
document  record  and  its  appearing  in  CD-ROM 
is  somewhat  high  in  most  cases(around  I  1/2  - 
2  months  after  the  appearing  of  the  record  in 
an  online  Database).This  delay  must  be 
acceptable  for  a  great  number  of  users,  but 
for  some  is  unacceptable.  Anyway  this  delay  is 
becomig  shorter  than  before  and  currently 
many  CD-ROMs  have  information  available 
only  one  or  two  weeks  after  than  online, 
-another  shortcoming  is  the  high  price  of  most 
Database  in  CD-ROM  which  only  makes 
profitable  the  use  of  this  type  Database  storage 
when  a  certain  number  of  searchs  are  done 
yearly.  And  therefore  if  we  need  to  acquire 
many  Database?  on  CD-ROM  its  costs  can  be 
very  high. 

Now  we  will  give  some  data  to  discusse  the 
increasing  expansion  of  the  use  od  CD-ROMs 
in  information  retrieval: 

-in  a  study  carried  out  in  1987  it  was  noted 
that  in  1986  were  put  in  the  market  19000  CD- 
ROM  units,  in  1987  would  be  put  50000,  in 
1988  about  137000  and  in  1989  c.a.  597000, 
that  means  to  multiply  by  30  the  sales  in  four 
years. 

-in  the  same  study  the  numbers  of  Databases 
on  CD-ROM  were  the  following:  in  1985  5  DB, 
in  1986  25  OB,  in  1987  125  DB  and  in  1988 
c.a.  210  DB. 

-in  "The  CD-ROM  directory",  edited  by  TFPL 
there  were  350  DB  in  1989,  715  DB  in  1990 
and  for  1991  the  figure  is  c.a.  1450  DB,that 
means  a  multiplying  factor  of  500%  for  the  last 
three  years. 

From  all  these  data  we  can  conclude  that  the 
the  CD-ROM  stored  information  has  a  growing 
importance  in  the  Database  market  and  that  in 
the  following  yean  the  number  of  DB  on  CD- 
ROM  will  incrmue  strongly.  On  the  other  hand 


taking  into  account  the  decreasing  prices  of 
CD-ROM  subscriptions  and  the  increasing 
number  of  Database  on  CD-ROM,  we  can 
suppose  that  in  in  the  next  8  or  10  years,  the 
use  of  online  Databases  will  be  greatly  reduc^, 
being  used  these  Databases  only  when  the  CD- 
ROM  counterpart  is  more  expensive  than  the 
online  one  or,  of  course,  when  the  Database  is 
not  on  CD-ROM. 

Other  increasing  application  of  CD-ROM  is  it 
use  in  the  elaboration  of  parts  Catalogs  for 
different  equipment  and  machines  or  for 
equipment  maintenance  manuals  (e.g.for 
commercial  planes,  automobiles,  etc).  In  these 
cases  the  large  capacity  of  CD-ROM  is  used  to 
store  graphics  and  images  in  digitalized  mode. 

Another  application  of  CD-ROM  is  as  storage 
medium  for  multimedia  Databases,  for  storing 
text,  images,  graphics  and  sound,  this 
application  is  growing  increasingly. 

Another  very  interesting  feature  of  CD-ROM 
in  the  field  of  information  storage  is  the  use 
of  this  disk  in  erasable  form,  i.e.erasable  CD- 
ROM.  At  the  present  time  the  storage  medium 
of  this  type  which  have  reached  more 
commercial  diffusion  is  the  magneto-optic  disk; 
the  writing  process  is  the  following:  the  binary 
0  and  1  are  stored  in  magnetic  form  by  means 
of  a  high  power  laser  beam  which  heats  the 
base  material  at  a  temperature  at  which  the 
orientation  of  magnetic  particles  is  easily 
changeable  by  means  of  a  weak  magnetic 
field,  afterwards  the  laser  beam  is  released  and 
the  new  magnetic  particles  orientation  is 
"frozen",  ending  at  this  moment  the  writing 
phase.  In  the  reading  process  it  is  used  a  low 
power  laser  beam  and  this,  when  incides  on  the 
disk  surface  is  polarized  in  one  direction  if 
there  is  a  0  or  in  the  opposite,  if  there  is  a  1. 

The  erasing  process  is  carried  out  by  heating 
the  base  material  with  the  high  power  laser 
beam  and  applying  on  the  corresponding 
magnetic  particle  a  magnetic  field  opposite  to 
the  initially  applied. 

These  magneto-optic  disks  and  other  with 
different  physical  basis  are  used  increasingly  to 
store  large  data  amounts  in  a  little  volume  and 
we  can  suppose  that  they  will  replace  the 
magnetic  discs  in  a  near  future,  for  instance  to 
store  personal  Databases  of  large  size. 

Currently  is  increasing  the  use  of  CD-ROM 
networks  which  allows  the  simultaneous  access 
of  many  users  to  a  CD-ROM  Database.  The  use 
of  these  optic  networks  is  linked  with  the  use 
of  local  area  networks  and  in  this  way  allows 
the  access  to  data  from  different  access  points. 
In  this  way  we  can  say  that  the  "online*  access 
to  CD-ROM  Databases  is  beginning,  with  the 
consequences  that  this  fact  means. 
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The  following  topic  to  be  discussed  will  be  the 
use  of  hypertext  in  DBMS(Data  Base 
Management  Systems);  this  development  appears 
to  be  very  interesting  in  a  near  future. 

Since  1987  the  possibility  of  using  hypertext 
systems  has  been  considered,  to  allow  tim  users 
of  information  retrieval  systems  to  identify  the 
relationships  among  the  logic  records  of  a 
Database  and  to  display  the  related  information 
in  a  microcomputer  monitor.  We  can  mention 
for  instance  the  IR  system  Rivage,  which  has 
an  operating  scheme  based  on  hypermedia  and 
allows  the  interactive  browsing  of  a  images 
Database  stored  in  video-disc. 

Hypertext  technology  can  be  used  for  the 
following  objectives: 

-storage  and  management  of  non-lineal 
documents 

-storage  and  management  of  a  document 
Database  in  hypertext 

-management  of  the  semantic  structure  of 
concepts  (paradata)  together  with  the 
management  of  a  links  network,  which  can 
serve  for  a  thesaurus  management  and  the 
management  of  the  documents  containing  those 
concepts. 

In  this  frame  each  hypertext  node  must  contain 
a  full  document  or  a  part  of  a  document  and 
the  network  of  links  among  nodes  is  used  to 
connect  structurally  the  different  parts  of 
documents  and  to  connect  documents  with 
semantic  similarity. 

To  get  the  management  of  Databases  by  the 
hypertext  system  is  neccessary  to  define  the 
following  types  of  nodes  and  links: 

-text  nodes  (  capables  of  containing  text 
fragments) 

-topic  nodes(  capables  of  containing  the 
semantic  description  of  a  concept) 

-structural  links  (allow  the  non  linear 
documents  structuration) 

-semantic  links  (  allow  the  connection  of  topic 
nodes) 

-connecting  links  (manage  the  relationships 
between  the  concepts  appearing  in  thesaurus 
and  the  documents  of  the  hypertext  Database) 

In  brief:  a  Database  managed  by  an  information 

retrieval  system  consists  of  two  components: 

-the  set  of  documents  or  their  surrogates  and 

-the  indexing  terms  (  paradata) 

and  in  a  hypertext  system,  there  ate  two 

components: 

-the  set  of  documents  or  their  surrogates  and 
-a  links  network  connecting  the  documents 
with  a  semantic  or  stuctural  relationship  which 
is  equivalent  to  paradata. 

Partial  matching  criteria  must  be  used  in  the 
search  of  records  in  a  hypertext  Database, 


between  the  documents  search  models  and 
search  strategies. 

Each  topic  node  (  containing  a  thesaurus  term) 
is  related  with  the  set  of  documents  pertinent 
for  this  term  and  the  links  network  among 
topic  nodes  support  all  the  relationships  among 
the  terms  used  in  the  thesaurus. 

The  following  search  types  can  be  done:search 
of  character  strings;  non  sequential  browsing  of 
document  records  following  the  links  between 
concepts  belonging  to  the  documents,  and 
search  from  a  pertinent  record,  previously 
found,  following  the  weave  of  hypertext  links. 

The  links  between  nodes  can  be  as  follows: 
-links  between  descriptors  and  indexes,  which 
can  link  descriptors  with  a  thesaurus,  a 
hyerarchic  index  or  a  permuted  term  list 
-full  text  links,  to  link  similar  documents  and 
-citation  links,  to  link  two  or  more  documents 
cited  in  another  one 

Currently  is  being  develoveped  a  series  of 
hypertext  applications  to  the  management  of 
hypermedia  databases  (  with  records  formed  by 
a  combination  of  audio,  video  and  text). 

An  example  of  online  application  of  hypertext 
to  Databases  management  is  the  Hyperline  of 
ESA-IRS.  This  facility  is  an  information 
browser  that  allows  concepts  and  reference 
browsing  and  carries  out  the  semantic 
association  between  user  searched  concepts  and 
concepts  stored  in  the  information  retrieval 
system. 

Hyperline  allows  to  integrate  two  basic  elements 
of  information  retrievak  the  document  browsing 
and  the  navigation  through  concepts  and  adopts 
the  computer  interaction  with  browsing  and 
concept  association. 

In  which  way  is  Hyperline  elaborated? 
Documents  in  bibliographic  Databases  are 
indexed  and  transformed  into  records,  being 
also  classified.  At  the  same  time  as 
classification,  a  knowledge  base  from  the 
involved  concepts  and  their  mutual  relationships 
is  built.  This  knowledge  base  is  introduced  in 
the  computer  with  the  bibliographic  records; 
the  knowledge  base  is  used  and  explored  by 
Hyperline,  allowing  the  concept  navigation  and 
the  reading  or  browsing  of  references  in  any 
navigation  moment 

It  has  been  developed  a  new  Information 
letrieval  model  with  a  two  levels  architecture; 
the  set  of  relevant  records  appean  in  the  first 
level  and  the  semantic  related  concepts  are  put 
in  the  second  one.  The  first  level  is  managed 
by  the  I.R.  system,  and  the  second  oik  has 
been  designed  as  a  conceptual  interface  between 
the  user  and  the  records  set 


In  the  interface  man-computer  the  following 
functions  must  be  included: 
semantic  association;  concept  navigation 
forward  and  backward;  sequential  and 
associative  reading  of  references;  history  of  the 
interaction  and  support  for  query  formulation. 

We  discuss  each  of  these  functions  in  the 
following: 

-semantic  association:  the  purpose  of  semantic 
association  function  is  to  give  the  user  an  entry 
point  into  the  concept  network  stored  in  the 
system.  The  user  write  his  query  concept  in 
natural  language  words  and  that  is  put  by  the 
system  in  a  list  of  semantic  related  concepts, 
which  is  a  part  of  the  knowledge  base  which 
manages  the  system.  In  this  way  the  user 
receives  a  system  answer  regardless  the  terms 
used  in  the  query.  The  list  of  semantic  related 
concepts  enables  him  to  initiate  the  concepts 
navigation. 

-Navigation:  this  function  present  to  users  the 
possibility  of  browsing  the  semantic  concepts 
structure  which  represents  the  information 
contents  of  the  bibliographic  records. 

-References  reading:  in  any  time,  during  the 
navigation  by  data  structure,  the  user  can  read 
the  bibliographic  references  containing  the  term 
or  concept  which  is  being  examined  in  the 
semantic  network. 

-History:  The  history  function  keep  details  on 
the  history  of  the  user-system  interaction 
during  an  Hyperline  session;  it  displays  all 
functions  executed  during  the  navigation 
process. 

-Support  for  query  formulation:  At  any  time 
during  concept  browsing,  concepts  can  be 
selected  and  put  aside  for  subsequent  use  in 
Boolean  query  formulation.  This  allows  an 
intimate  interconnection  between  the  classical 
Boolean  searching  and  hypertext  browsing. 

Another  topic  to  be  discussed  will  be  the 
Nonboolean  matching  criteria. 

One  of  the  more  serious  shortcomings  that  the 
Use  of  Boolean  logic  operators  has  as  match 
criterion  is  the  impossibility  of  classifying  the 
records  obtained  as  search  result  according  to 
its  relevance  for  the  user;  in  other  words:  it  is 
impossible  to  rank  the  records  putting  first  the 
records  in  which  the  search  terms  are  very 
important  and  secondly  those  in  which  the 
search  terms  are  not  important.  Thb  is  due  to 
the  unability  to  assign  a  grading  of  terms 
importance  in  the  document  (  or  record) 
retrieved,  since  the  assignement  of  indexing 
terms  to  a  document  is  completely  binary  (  if 
the  term  is  important  It  is  assigned  and  if  it  is 
not  important  it  is  not  assigned). 


A  lot  of  other  matching  criteria  have  been 
developed  to  avoid  this  shortcoming  and  the 
more  important  are  the  following: 

-document  vectors  criterion,  in  this  method  the 
indexed  documents  are  represented  by  a  set  of 
document  vectors;  each  vector  is  a  set  of 
concept  numbers  (  codes)  with  weights,  the 
concept  number  represents  the  indexing  terms 
assigned  to  each  document  and  the  weights  are 
the  relative  importance  of  each  term  in  the 
document.  The  information  request  in  natural 
language  is  translated  in  a  similar  mode  to  a 
search  vector  and  the  retrieval  process  is 
carried  out  by  comparing  the  search  vector 
with  all  or  some  of  document  vectors.  Then 
document  vectors  are  ranked  by  descending 
order  of  coincidence  with  the  search  vector. 
The  search  vector  is  modifyed  by  relevance 
feedback  and  the  search  is  itered  until 
obtaining  good  results. 

-Cluster  analysis,  the  document  clusters  are 
prepared  by  comparison  between  the  indexing 
terms  of  a  document  and  the  indexing  terms  of 
the  other  ones  and  clustering  those  documents 
whose  indexing  terms  are  similar.  For  each 
cluster  a  representative  element  is  chosen, 
named  centroid  vector  and  the  search  is  carried 
out  in  two  steps:  in  the  first  the  search  strategy 
is  compared  with  all  the  centroid  vectors  and  in 
the  second  one  the  search  strategy  is  matched 
against  the  individual  documents  of  clusters 
with  centroids  very  similar  to  search  strategy, 
found  in  the  first  step.  This  step  search  can  be 
broadened  to  three  or  more  steps  by  grouping 
centroid  vectors  in  broader  clusters  of  greater 
coverage.  In  this  case  the  search  begins  by  the 
broadest  clusters  and  then  is  carried  out  with 
the  more  specific  clusters. 

-Method  of  fuzzy  sets,  in  this  method  a  fuzzy 
set  of  document  identifiers  is  assigned  to  each 
index  term,  in  this  fuzzy  set  the  grade  to  which 
each  term  Mongs  is  given  by  a  weight  between 
0.1  and  I,  and  when  the  term  does  not  belong 
to  the  set  its  weight  is  0.  With  the  fuzzy  sets 
all  the  Boolean  logic  operators  can  be  used  but 
it  is  necessary  to  lose  some  axioms  of  that  logic 
to  obtain  consistent  search  results. 

-Probabilistic  retrieval,  a  system  of  probabilistic 
search  can  begin  a  search  by  assigning  numeric 
values  of  probability  or  uncertainty  to  indexing 
terms  and  by  using  the  probability  rules  can 
obtain  the  probability  of  pertinence  of  a 
document  for  a  search  topic.  These  pertinence 
probabilities,  obtained  from  the  document 
indexing  terms  govern  the  information  retrieval 
decisions  of  the  system.  In  these  systems  the 
Boolean  relationships  are  lost. 

-Search  by  means  of  the  *nmu«st  neighbor^ 
in  this  method  a  matching  is  canned  out 
between  the  general  set  of  search  terms  and  the 
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indexing  terms  of  each  document  and  the 
selected  documents  are  ranked  by  descending 
order  of  likeness  with  the  re4uest.The  'nearest 
neighbor*  is  the  document  witi.  a  set  of 
indexing  terms  most  similar  to  search  strategy 
that  has  been  found  in  a  certain  moment  of  the 
search.When  a  document  is  found,  which  is 
more  similar  to  search  strategy  than  the  actual 
'nearest  neigbor',  that  document  becomes  the 
new  'nearest  neighbor*. 

-Among  the  probabilistic  retrieval  methods 
appear  the  weights  method,  in  which  weights 
are  assigned  to  search  in  each  document  or/and 
in  the  search  strategy.This  method  used  in 
combination  with  the  Boolean  operators  one, 
enables  us  to  avoid  some  shortcomings  of  the 
Boolean  method  and  has  some  implantation  in 
Database  market. 

All  of  these  methods  have  only  a  low 
implantation  level  in  the  daily  rutine  of 
information  retrieval;  this  is  due  on  the  one 
part  to  the  low  impact  of  these  investigations 
on  the  commercial  services  of  information 
retrieval  and  on  the  other  part  to  low 
knowledge  level  in  this  field  of  the  Hosts  staff 
members  about  the  investigations  in  the  field 
and  its  theoretical  basis. 

Among  the  advantages  of  non-Boolean 
matching  criteria  are  the  following: 

-it  is  not  necessary  to  prepare  a  Boolean  search 
strategy,  it  is  enough  to  present  to  the 
information  retrieval  s}«tem  a  set  of  search 
terms  that  we  want  it  to  appear  in  the 
documents  to  be  retrieved. 

-the  weighting  coefficients  are  easy  apticable. 
-the  feedback  of  search  terms  from  retrievced 
records  is  easy. 

-it  is  possible  to  use  a  search  terms  amount 
larger  than  in  the  Boolean  search  method. 

-a  flexible  Boolean  logic  can  be  used. 

-the  relationships  between  search  terms  and 
documents  can  be  expressed. 

And  on  the  other  hand  we  can  add  the 
following  to  the  shortcomings  of  the  Boolean 
method: 

-it  is  necessary  a  very  good  indexing 

-it  is  necessary  a  correct  use  of  the  Boolean 

logic,  which  is  difficult  many  times. 

Finally  we  will  speak  about  the  use  in  the 
Library  environment  of  Local  Area  Networks 
(LAN). 

We  can  define  a  LAN  in  the  following  terms: 
it  is  a  private  owned  communications  facility 
which  link  devices  in  a  small  area. 

We  will  comment  this  definitioa: 

-a  LAN  must  be  fully  own^  and  operated 
privately  by  the  Institution  which  funds  its 
activities. 

-it  is  a  communkatons  facility  that  allows 
devices  to  exchange  information;  several  of  the 


components  of  a  LAN  must  have  idependent 
intelligence  and  processing  power.LANs  can 
transmit  data,  video,  images,  audio  messages 
and  facsimile. 

-a  LAN  link  devices,  a  device  linked  in  a  LAN 
must  be  able  to  communicate  at  least  to  another 
device  on  the  network.  The  linked  devices  can 
be:  CPU,  command  and  control  systems,  dumb 
or  intelligent  terminals,  fax  systems,  interactive 
video,  peripheral  devices  (tapes,  disks,  etc.), 
telephones  and  message  transceivers. 

-it  is  a  facility  for  a  small  area  (  with  a 
maximal  separation  among  devices  of  about  2 
km) 

The  LAN  components  are  the  following:  the 
cabling  system,  workstations,  servers,  interface 
units  and  network  software. 

There  are  three  systems  for  cabling:twisted  pair 
wire,  coaxial  cable  and  optic  fiber  cable.  The 
twisted  pair  wire  has  a  main  weakness:it  is 
susceptible  to  noise,  however  this  drawback  can 
be  minimised  with  a  proper  shielding  of  the 
cable.  Its  transmission  rate  is  the  lowest,  only 
reach  from  250  Kb/s  to  2  Mb.  s  in  the 
baseband. 

Coaxial  cable  can  achieve  higher  transmission 
rates,  about  some  Mb/s  without  signal 
regeneration,  it  also  allows  greater  distances 
than  twisted  pair  wire  and  a  greater  number  of 
attached  devices. 

The  highest  performance  is  attained  with  the 
optic  fiber  cabling,  that  allows  transmission 
rates  of  some  Gb/s,  has  low  weight  and  is  more 
noise  resistant  than  the  other  two  cabling 
systems.  By  this  cable  can  be  trasmitted  voice, 
images,  video  and  data. 

The  workstations  are  microcomputers  used  to 
access  or  manipulate  data. 

The  servers  are  microcomputers  which  provide 
access  to  shared  resources;  for  each  shared 
device  or  set  of  devices  an  associated  server 
must  be  contacted  before  use.  There  are 
systems  with  independent  servers  and  other 
ones  in  which  the  server  function  is  allocated 
to  some  of  the  workstations. 

The  interface  units,  which  allow  the  logical 
connection  between  the  computation  devices. 

The  LAN  software:  a  LAN  needs  software  to 
run  and  perform  all  its  functions.  There  are 
thsee  types  of  software:  the  system  software, 
which  manages  the  hardware  allows  other 
software  to  operate  using  the  host  hardware; 
the  network  software  which  allows  the 
interconnection  between  applications  and  the 
network  and  the  applications  software 
(wordproceesing,  DBMS,  etc.). 
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Topologies:  The  LANs  have  three  basic 
topologies:  star,  ring  and  bus/tree.  A  star  is  a 
group  of  connected  devices  (  nodes)  served  by 
a  central  device.  A  bus/tree  is  a  multiple  access 
broadcast  medium,  the  bus  being  a  special  case 
of  tree  with  only  one  trunk.  The  ring  is  a 
closed  bus  with  each  node  attached  to  a 
repeating  element. 

The  most  common  accessing  protocols  for 
LANs  are  the  following:carrier-sen$e  multiple 
access  with  collision  (  of  messages)  detection 
(  CSMA/CD),  carrier-sense  multiple  access 
with  collision  avoidance  (  CSMA/CA)  and 
token  passing.  In  the  CSMA/CD  protocol  a 
network  node  transmits  a  message  after 
listening  to  the  network  to  make  sure  it  is  not 
busy  and  then  begins  to  transmit.  In  the 
CSMA/CA  protocol  message  collisions  get 
detected  by  the  sending  and  receiving  nodes. 
Token  passing  is  a  deterministic  accessing 
scheme,  since  each  node  is  guaranteed  access 
within  a  set  period  of  time  and  the  access  is 
controlled  by  a  token  circulated  among  all 
nodes  at  a  constant  speed. 

LANs  optimize  the  concept  of  online  in  several 
ways:  make  easy  online  sharing  of  expensive 
hardware;  the  software  can  be  shared  online 
without  physical  transporting  diskettes  form  one 
micro  to  another  and  the  communications 
modems  can  also  be  shared. 

The  main  reasons  for  networking 
microcomputers  in  Libraries  are  the  following: 
-to  access  common  data  (e.g.  OP  AC, 
acquisitions  and  serials  information) 

-to  share  expensive  devices  (  hard  disks,  high 
quality  printers,  plotters) 

-to  share  software 

-to  allow  electronic  mail  among  departaments 
and  patrons  and 

-uploading  and  downloading  from  other  systems 
(data  banks). 

The  LANs  are  also  used  for  the  following 
reasons: 

To  improve  communications  and  to  share 
equipment  and  software. 

LANs  can  manage  relatively  large  data  files 
and  support  a  large  number  of  simultaneous 
users  on  a  locally  controlled  system  and  also 
provide  for  the  efficient  use  of  resources 
trough  shared  peripherals,  such  as  printers, 
plotters  and  software 

LANs  provide  for  system  data  security  trough 
centralized  backup  and  for  data  integrity 
trough  shared  use  of  a  central  Hie,  it  also  may 
contribute  to  improved  communicatiotts. 

Finally  we  will  consider  some  of  the  issues  m 
be  anlyzed  when  considering  the  instaUatkm  of 
a  LAN: 

-are  there  sufficient  requirements  for  using  a 
LAN  in  your  environment? 


-Is  LAN  hardware  and  equipment  available  to 
you  from  the  financial  and  physical  vienqjoints? 
-will  the  LAN  be  reliable? 

-is  there  growth  potential  for  the  LAN? 

-is  security  and  control  on  a  LAN  an  important 
issue? 

-how  will  traffic  volume  on  the  LAN  affect 
your  access? 

-what  kind  of  speed  on  the  LAN  will  provide 
the  required  response  time? 

I  hope  to  have  given  a  suitable  global  view  of 
some  of  the  more  interesting  topics  presented 
at  this  Meeting  with  the  above  considerations. 
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1.  SUMMARY 

The  diacuasion  is  confined  to  the  Man  Machine 
Interface  problems  encountered  during 
implementation  and  post  implementation.  Various 
development  aspects  are  considered,  commencing 
with  the  definition  of  the  users'  re<(uirement,  as 
distinct  from  the  users’ vrishe^  to  the  provision  of 
adequate  post  implementation  support. 

The  LAN  installed  at  the  Defence  Research 
Information  Centre(DRIC)  during  1987  and 
subsequently  enhanced  in  1988  is  taken  as  the 
model  for  the  discussion. 

The  aspects  considered  are: 

DBICt  warUng  pneeduret  -  immediately 
prior  to  installation  of  the  LAN  in  April  1987. 

noted  imflaneatadon  -  of  the  LAN  and 
associated  consultation  procedures. 

Phase  1  -  Proviaon  of  Information 
Retrieval  fiicilities,  for  scientific  staff. 

Phase  Y  ■  Provision  of  a  Document 
Movement  Control  System. 

ntl  unplemenhilinn  review  •  review  of 
requirements,  problems  identified,  user 
reactions. 

CiuTeet  Acftia'ftee  -  endeavours  made  to 
achieve  a  satisfactoiy  system  performance 
level,  including  changes  to,  cmnputer 
processing  pattern,  working  procedures,  supply 
of  office  flimiture. 

For  each  of  the  aqiects  listed  above  details  of 
problenM  encountered  and  solutions  implemented 
are  givea 

».  INTRODUCnON 
SJ  Omerol 

DBIC  ^  is  the  MOD’S  central  deposit  and 
disMminatiou  point  for  defence  scientific  and 
tedmieal  Bterature  to  the  UK  and  overseas 
Defence  Casununify.  In  March  1986  it  moved 
flnm  St  Mary  Cray  to  central  Qiaagow,  where  h  is 
now  heated  ahog  whh  a  mmiber  of  oth«  MOD 
limetMafaiainademaeieebhek.  Itwaafonned 
in  October  1971  by  the  taerga  of  the  defence 
ocaponent  of  the  Technology  Reposts  Centre 


(TRC)  with  the  Naval  Scientific  and  Technical 
Information  Centre  (NSnO.  DRICispartofthe 
Assistant  Chief  Scientific  Adviser  (Researeb) 
(ACSA(R))  Organisatum.  Mr  M  R  C  Vfilkinson  is 
Head  of  DRIC  and  he  reports  to  Director  Research 
(Technology).  In  order  to  discharge  its  remit 
DRIC  is  organised  as  follows; 
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DBIC'b  total  complement  is  57. 


Cvmpntmg  faeUidea 

The  current  configuration,  is  based  cm  a  Digital 
Equipment  Company  (DEC)  VAX  6310  computer 
and  consists  of: 


32  Mb  of  memory 

3.2  Gb  of  magnetic  HiA 

3  Laser  printers  (desktop) 

3  Dot  matrix  printers 
1  Magnetic  tape  unit 

'The  functions  currently  Biq^iorted  by  the  LAN  are 

depicted  in  the  followhig  diagram: 
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The  comiMiter  databases  bcdd  over  250,000  records 
of  documents  available  from  DSIC  and  dating  from 
about  1970.  The  DBIC  holding  however,  consists 
of  c^qiroximatety  600,000  document  titles  dating 
back  to  around  1940.  The  databases  are  current^ 
accessed  via  a  LAN,  supporting  40  video  terminals, 
2  bar  code  label  printers  and  18  bar  code  label 
reeulers.  Tbe  LAN  conaiats  of  a  mix  Fibre  Optic 
and  Ethernet  CaUes. 


3.  DISCUSSION 
3J  DBIC -hefim^>nI1987 
Wat  to  April  1987  DBIC's  databases  were  held  at 
an  MOO  Bureau  facility  in  London.  In  order  to 
update  these  databases,  data  was  extracted  from 
tte  relevant  documents  and  entered  on  computer 
input  forms.  These  forms  were  encoded  into 
machine  readable  format  by  DRIC's  own  data 
preparatkm  staff,  on  an  in-house  facility  based  on 
a  GEC  4080  computer.  The  data  was  then 
transferred,  via  the  postal  services,  to  the  bureau. 

The  information  retrieval  st^ware  in  use  at  the 
bureau  vras  limited  in  its  capability  and  tacked 
flexibility. 

The  availability  of  computing  facilities  was 
restricted  to  scientific  staff  involved  in,  information 
retrieval  services  and  publication  of  DRIC’s 
announcement  bulletins.  Access  was  by  a  single 
terminal  shared  by  ten  sdentists.  AH  output 
generated  by  database  searching  was  printed  at 
the  burettu  and  was  delivered  fay  poet  to  the 
scientist  concerned.  At  this  point  the  scientist  got 
out  the  scissors  and  ^e  pot  (cut  and  paste)  to 
produce  an  acceptable  re^t  for  the  requesto*. 
This  natural^  caused  considerable  delay  to  the 
supply  of  infmmation. 

Acquisitioas,  distributions  and  the  supply  of 
documents  to  requesters  were  handled  manually. 
This  entailed: 

The  maintenance  of  an  enmmous  cwd  index 
system  containing  document,  receipt, 
movement  and  bibiiographic  data. 

The  maintenance  of  manual  records  relating  to 
distributions  and  requests. 

The  manual  preparation  of,  despatch  notes, 
receipts,  address  labela  and  infarmstkm 
relat^  to  <Rttributiao  and  special  reatrictioas. 

At  tlds  stage  there  was  no  automated  procedure 
for  rsgfrtering  rsca^iit  and  movement  of 
dowanents.  Documente  arrived  in  DRIC  and  their 
exiatenee  was  uokxwwn  to  Um  tystem  until  thty 
were  ertered  in  the  bureau  database  many  weda 
afterarrivaL  Hmse  ciwdhhins  led  to  a  number  of 


mines' security  breaches  causing  the  deployment  of 
scarce  resources  to  investigate. 


XS  PteBminarj  Sttufy 

In  1984/86  it  became  necessary  to  consider  the 
refdacement  of  DRIC’s  computing  facilities. 

This  was  due  to: 

The  age  of  the  in-house  computer  used  for 
data  capture. 

Changing  factors  at  the  bureau. 

The  age  and  source  language  of  the  retrieval 
software  in  use  at  the  bureaa 

’The  impending  move  to  Glasgow. 

The  first  step  was  to  establish  the  so^  of  the 
requirement  before  examining  the  various 
hardware  and  software  options  available.  A  small 
working  group  was  therefore  set  up  consisting  of 
representatives  from  DRIC’s,  Information 
Technedogy  (IT)  Sectiem  •  the  system  pmuiders, 
and  Publications  and  Technical  Enquiries  Sections 
-  the  users.  The  group  also  mduded  a 
representative  from  Director  General  Information 
Technology  Systems  (DGITS).  DGITS  is  the  MOD 
ditectorate  responsibie  for  all  aq)ects  of  IT  Polity 
and  Standards,  including.  Technical  appraisal  and 
Proewement. 

A  rqnesentative  from  the  Central  Computer  and 
Telecommunications  Agency  ((X/TA)  was  also 
involved  with  the  working  group.  (XTAispartof 
HM  Treasury. 

The  group  was  tasked  with: 

Establisliing  the  requirement. 

Sizing  the  requirement  in  terms  of; 

Input  and  Output  data  vtdumes. 

Fized  data,  ie  database  volumes. 

Database  searching. 

Volume  of  printed  output  rdating  to 
Document,  announcement,  siqtply  and 
distribution. 

Hie  number  of  l^deo  Terminals  needed  fay 
the 'various  user  secUona. 

Identifying  commerrially  avaiU>)e  software 
packages  (xqiable  of  meetkig  the  raquiraaMnt 


Idcnti^ong  hardware  oapahte  of  supporting  the 
sised  requirement  and  the  software  package 
jtidged  to  be  to  most  suitaUe. 

Producing  an  inqdementation  plan  which 
would  cause  the  least  disruption  to  DRlC’s 
customer  service. 


Sy$km  Sdtdian 

The  woiUng  group  estaUiahed  the  af^tem 
requirement  as  consiBting  of  two  separate 
ftinctions: 

Phase  1  •  Provisiai  of  Information  Retrieval 
and  Document  Announcement  fadhtiee. 

Phase  2  -  Provision  of  a  Document  Movement 
Control  System. 

This  led  to  the  decision  that  the  Infmmation 
Retrieval  and  Document  Announcement  facilities 
should  be  inq)lemented  first  It  was  anticipated 
that  this  wo^  be  reasonably  atrai^t  forward 
since  the  relevant  data  and  these  fhcilitiea,  in  a 
liinited  form,  were  already  availaUe  at  the  bureau. 

The  sising  eserdse  indicated  that  baaed  on  the 
assessed  volumes  of  Input  and  Output  data  and 
projected  database  aise,  there  would  be  a  need  for 
1.8Gb  of  dialc  storage  for  data  and  system  files  at 
the  system  installatinn  stage.  It  was  also 
estimated  that  this  requirement  would  increase  by 
another  2Gb  within  3  years  of  installatirai. 

The  overall  number  of  Video  Terminals  required 
was  estimated  at  37  (10  for  Phase  1  and  27  for 
PbMe2). 

The  working  peup  identified  a  number  of 
commsrciaily  avaikifale  infimnation  retrieval 
paduges.  An  evaluation  of  six  paduges  reduced 
the  number  to  two  wtddi  were  considered  most 
suitable  for  the  requirem«it. 

The  parkags  eventually  chosen  was  Computer 
Aided  Informatiaa  Retrieval  System  (CAHtS) 
supplied  fay  Leatherhead  Food  Research 
AasociatKm  (LFRA). 

It  was  decided  to  mvite  LFRA  to  tender  for  the 
supply  cf  the  eoftware  pai^age  and  suHabie 
hartbsare  to  atqipart  the  requhement.  This 
required  DRIC  to  draw  up  a  ftA  Operational 
Reqidrement  (OR)  far  ayeament  with  Dorrs  and 
CCTTA.  Thh  document  is  the  faaais  on  wUefa  a 
suppfier  is  invited  to  tender  and  sets  out  aB 
mandatory  mid  desirable  requirements  of  a 
prqject. 


After  much  discussion  and  modificatinn  the  OR 
was  agreed  and  in  October  1986  LFRA  were 
invited  to  tender  for  the  ivqject. 

The  implmnentation  {dan  produced  by  the  working 
groiqi  required  that  the  qrstem  be  introduced  in 
two  phases: 

Phase  1  -  Provuion  of  Informafion  Retrieval 
facilitiea  for  adentific  staff. 

Phase  2  -  Provisicm  of  a  Document  Movement 
Omtnd  System  (DMCS). 

Phase  1  would  begin  with  the  installatinn  of  the 
hardware,  software  and  conversion  cf  existing  data 
held  on  the  bureau  computer.  Phase  2  would  be 
designed  and  produced  in-bouae  by  DRIC’a  IT 
section. 

The  target  date  for  the  start  of  phase  1  was  set 
for  1  May  1987  while  the  date  for  phase  2  was  aet 
for  A]xil  1988. 

Ry  the  end  of  1986  LFRA  bad  aubmitted  their 
propoeal  to  supply  the  CAIRS  software  package, 
additional  customised  conversion  and  print 
software  and  hardware  to  support  the  lystem. 
The  hardware  proposed  was  a  DEC  VAX  8200 
computer. 

X4Pttmet 
Syalrai  Pnio^fpt 

The  databases  at  the  bureau  bad  been  designed  in 
the  early  IPfOa  for  a  computing  envinament 
Bui^>orted  by  Intematknal  Coaqwters  limited 
(ICL)  and  h^  been  modified  a  number  of  times 
sincetben.  There  was  therefore  a  need  to  convert 
the  data  ficm  ICL  to  DEC  and  then  to  CAIRS 
format.  Thia  being  the  caae  the  opportunhy  was 
taken  to; 

-  remove  ofaeolete  data  elements. 

-  introduce  additional  data  dementa. 

-  rearrange  the  ordw  of  the  data  dements. 

DRIC,  with  the  awnstanrc  of  a  consultant  from 
LFRA,  inqtlemented  a  prototype  of  the  Phase  1 
tystem  on  a  PC  baaed  versian  of  CAIRS.  The 
prototype  was  moat  uaefid,  enabling  both  IT  and 
Scienti&  staff  to  trW  and  modity  the  qntem 
before  ftdl  hnplementatian. 

Indeed  it  was  at  thh  stage  that  DRIC  mada  a 
ftmdamentd  change  to  the  tydem  raquktemsnMt 
was  decMad  that  Ptwas  i  •  Dneumant  Movement 
OonlMl  Oyatm  sfaedd  have  ha  own  sapmali 
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The  qfstem  requirement  as  anginal^  specified, 
propoeed: 

That  the  bureau  fumat,  of  one 

record  for  each  document  mlded  to  the  ORIC 
coOectiao  Mid  keyed  on  the  DBIC  acceaakm 
number,  should  continue  in  use. 

lliat  the  aixe  of  this  record  be  increased  to 
include  data  relating  to  document  movements 
to  meet  the  Phase  2  requirement. 

lliat  the  greatest  benefit  would  be  gained,  at 
the  earliest  possible  time,  by  converting  data 
fitnn  the  bumu  as  follows; 

First  -  records  fittm  1980  to  1987. 
Scientific  staff  would  be  able  to  take 
advantage  of  the  new  system  in  a 
progressive  manner,  while  still  using  the 
bureau  fiicility. 

Second  -  records  fiom  1970  to  1979. 

That  the  two  sets  of  converted  data  be  merged 
to  form  one  laige  database. 

At  the  {Hototyping  stage  DRIC  was  made  aware 
that  in  a  CAIRS  system  the  size  of  a  database  is 
limited  to  2S6Mb. 

It  was  apparent  that  the  extra  data  required  for 
Phase  2  would  cause  this  Umit  to  be  exceeded 
even  for  the  1980  to  1987  data. 

bulaUatian  of  Hanktare  t  Software 
The  CAIRS  soft  ware  and  the  DK  VAX  8200  were 
delivered  in  April  1987.  About  the  same  time 
DEC  annouDC^  that  aU  8200’s  were  to  be 
upgracted  to  82S0’s.  The  computer  delivered  to 
DRIC  was  upgraded  in  October  1987  at  no  extra 
cost. 

After  successful  acceptance  trials  the  task  of 
converting  data  fiom  Um  bureau  system  began  in 
May  1987. 

At  this  stage  the  configuration  was  baaed  on  a 
OEC  VAX  8200  computer  and  cmsisted  of: 

8Hb  of  memory 
IffGb  of  maffietie  didt 
2  Laser  priifters  (desktop) 

2  Dot  matrix  printers 
1  Magnetic  tape  unit 

DataCaaawaim 

It  was  determioed  that  reconb  fitan  the  bureau 
databsas  Mwsdd  be  eoavertsd  In  hatdMs  of 200.  At 
first  the  converston  process  ran  fidr^  ■aootl^y, 
until  spproikBataty  10,000  rsoordi  had  been 


converted  and  the  process  time  began  to  increase 
considerably.  It  was  projected  that  the  conversiion 
of  all  records  was  now  Uk^  to  take  three  to  four 
times  as  long  as  originaliy  estimated. 

The  conversioQ  process  was  suq>ended  and 
discussions  took  place  with  LFRA  e^>lained 
that; 

The  conversion  time  is  director  related  to  the 
size  of  the  database  and  in  particular  the 
extent  to  whiidi  a  record  is  indexed. 

In  DRIC’s  case  almost  every  word  in  a  recmd 
is  indexed,  except  ot  course  the  nonnal  stop 
words.  This  means  the  index  files  have  maqy 
terms  with  several  thousand  postings. 

These  factms  together  with  the  LFRA 
requironent  that  the  index  file  must  ahnys  be 
in  absolutely  precise  order,  whkdi  means  a 
complete  file  reorganisation  at  eadi  update, 
make  the  task  very  great  indeed,  since  on 
average  there  are  60,000  poatings  par 
conversion  batch.  The  end  result  is  that  the 
inverted  file  represents  some  40%  of  the  size 
of  the  searchable  data.  Retrieval  is  to  a  large 
extent  governed  by  the  size  of  the  index  files 
and  here  (AIRS  benefits  fimn  its  small  tkly 
files. 

As  a  result  of  these  discussions  varknis  elonents 
of  the  (Ants  software  facilities  were  optimised. 
Ihe  conversion  process  was  recommenced  and  the 
optimisation  produced  a  25%  reduction  in 
processing  time. 

The  conversion  process  proceeded  aatisGsetorily 
and  as  the  transfer  of  the  1980  to  1987  records 
progressed  DRIC  became  less  and  less  dependant 
on  the  bureau  services  and  more  able  to  use  the 
new  (AIRS  facility. 

Further  problems  were  encountered  as  the 
conversion  process  handled  the  1970  to  1989  data. 
Due  to  the  age  of  the  database  and  diang^  in 
requirement,  it  had  beoi  necesaaiy  to  introduce 
changes  to  the  data  structures  at  the  bureau. 
Unfortunately  these  dhanges  had  not  been  made 
retrospective  nor  had  they  been  documented.  Ihe 
result  was  that  eveiy  so  often  the  pro(;ea8  would 
£aiL 

In  these  cvcumstances,  it  toMc  vatytag  amounts  of 
time  to  identity  the  biA  and  get  LFRA  to  produce 
suitabty  amended  software  to  continue  the 
process. 

Tlmnighout  the  eonveraioo  process  the  SIcienUBe 
and  IT  staff  used  the  LAN  with  no  problems  or 
dfficuMies. 
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The  functions  supported  by  the  LAN  at  this  stage 
were: 


*  NOTE:  When  the  qrstem  was  accepted  in  May 
1987  the  input  of  new  document  records  was 
transferred  firom  the  bureau  to  DRIC’s  new 
computing  fadlity.  This  allowed  DBIC  to  conduct 
Quality  Assurance  checks  via  the  LAN. 


3JSnu»e2 

Btqidranent 

The  Phase  2  requirement  was  to  provide  a 
Document  Movement  Ccmtrol  System  (DMCS). 
This  was  needed  to  ensure  safe  ctistotfy  of 
classified  documents  and  to  manage  the  immense 
task  of  keeping  track  of  document  movements. 

Each  year  the  number  of  document  movements, 
into,  throu^  and  out  of  DBIC  are; 

Receipts,'  10,000  individual  titles  plus  copies, 
giving  a  total  of  20,000  to  30,000  document 
copies.  Of  the  10,000  titles  received,  7,000  are 
added  to  the  DBIC  database.  This  entails  5 
separate  movements  for  each  document. 

Distributions;  46,000  document  copies  are 
distributed  in  accordance  with  originator’s 
instmctimis. 

Requests;  11,000  docummits  are  supplied  on 
request. 

The  total,  anntial  document  movements  is  in  the 
region  of  100,000.  (See  '  Document  Movement 
Patterns’  diagram  briow.) 

’Ihese  movements  were  recmded  nsamuiUy  after 
the  event.  ’This  meant  that  it  was  extremely 
(SfBeult  to  trace  the  exact  location  of  an  individual 
doniment. 


Document  itaoement  PuUems 

Rmemiptm  Diatributioas  Raaummta 

lUicozdiiig  Distribution*  Stockrooa 

I  I  I 

V  V  V 

Abstracting  Post  roo«  Rsquasts 

V 

Data  Prsparation 
▼ 

Quality  Assurane* 

V 

StockrooM 

Design  and  deodcpment 
During  the  design  and  development  of  the  DMCS 
there  was  constant  consultation  betwemi  the 
DBIC,  FT  section  and  the  various  user  sections. 
Additionally,  as  required  by  national  agreements, 
there  was  consultation  between  the  IT  section  and 
Trade  Union  representatives  regarding  the 
introduction  of  new  technology. 

’The  additional  terminal  requirement  was  assessed 
as  27  units.  'Their  positioning  was  discussed  with 
the  users  and  two  options  were  considered: 

First;  cluster  the  terminals  at  convenient 
locations  throu^out  the  work  area.  'Ihis 
would  allow  staff  to  use  an  available  terminal 
as  and  when  necessaiy. 

Second;  distribute  the  terminals  between 
groups  of  staff  on  the  basis  of  2  terminals  to  3 
staff. 

’The  first  option  would  require  that; 

spedai  areas  or  rooms  be  set  aside  to 
accommodate  the  terminals. 

additional  furniture  be  supplied. 

staff  would  have  to  check  the  availability  of  a 
terminal  for  their  use. 

staff  would  have  to  take  their  work,  from  their 
ntmnal  work  area,  to  the  terminal 

It  was  decided  that  the  second  option  would 
provide  the  best  environment  for  tte  staff  and 
would  cause  least  disnqttkm  to  the  normal 
workflow. 

Bar  coding  was  to  be  used  to  facilitate  the 
recording  of  document  movements,  and  a  need  for 
10  bar  code  readers  and  2  bar  code  printers  was 
estahUsbed. 


The  development  of  the  DMCS  wm  proffeased  in 
paraBri  with  the  proemmiieiit  of  the  additkoal 
hardware  needed. 
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As  each  software  module  was  {noduced  it  was 
presented  to  the  users  for  testing  and  acceptance. 

All  staff  to  be  emplt^ed  on  the  new  DMCS 
attended  a  3  days  computer  appreciation  course  to 
prepare  them  for  the  new  challenge  ahead  and 
DRIC  acquired  a  software  package  designed  to 
develop  and  improve  their  keyboard  skills. 

'nirougbout  the  design  and  development  of  the 
system  the  IT  section  followed  MOD  approved 
prqect  management  and  system  development 
methodologies. 

ImplemaUatian 

The  additional  hardware  was  delivered  in  March 
1988  and  the  DMCS  went  live  in  April  1988. 
During  the  next  6  months  a  variety  of  problems 
were  encountered,  the  majority  of  these  were 
easity  rectified  by  software  amendments. 

However  some  presented  greater  problems.  A  few 
examples  are  listed  below. 

Terminala  wrongly  tiled 

There  was  an  immediate  spate  of  complaints 
concerning,  so  called,  screen  glare  and  poor 
lighting  conditions.  The  real  problem  was  that 
despite  advice  on  the  subject,  many  users  had 
positioned  their  desks  such  that  ^e  terminal 
screen  was  in  direct  sunlight  and  they  did  not 
make  use  of  the  artificial  li(^ting  which  had  been 
installed  to  the  approved  standards  for  use  with 
terminal  screens. 

The  result  was  that  screens  reflected  images 
rather  than  give  off  ^are. 

Unidentified  functions  and  baht. 

Throu^Mut  the  first  few  months  previously 
unidentified  functions  and  tasks  were  brought  to 
tight  eg; 

In  canying  out  its  duties  in  accordance  with 
the  DMCS  requiremoit  a  section  would 
discaver  a  need  for  a  specific  task  which  they 
had  not  specified  On  investigation  it  was 
found  that,  the  task  had  previously  been 
carried  out  hy  some  otibet  section,  but  had  not 
hem  identifM  at  the  qntem  devetopment 
stage. 

There  were  two  main  reaatms  for  this  ^rpe  M 
ondsaiaa: 

When  DBIC  moved  team  St  Mary  Cn^  to 
GJasfow  in  1986  there  was  a  complete 
dimge  of  staff,  from  top  to  bottom, 
resulting  in  a  lack  at  eocrtinuity.  The 
knowiedgs  base  wns  therefore  too  shallow 
to  bu3d  a  trouble  five  lystem. 


There  was  also  a  lack  of  good  quality 
documentation  of  the  existing  manual  qratem. 

These  two  factors  would  be  nugor  impediments  to 
the  development  of  any  system. 

A  knowledge  base  can  only  g^in  depth  through 
time  and  experience  and  it  is  very  difficult  to 
recognise  when  sufikient  depth  has  been  attained 

Documentation  of  the  existing  system,  would 
normally  contain  details  of  how  each  function 
should  be  carried  out,  together  with  workflow 
diagrams  and  examples  of  any  forms  used  Again 
the  lack  of  continuity  was  a  mqor  factor. 

To  have  delayed  development  of  DMCS  until  a 
satisfactoiy  level  of  knowledge  was  achieved 
would; 

have  deprived  DRIC  of  early  implementarion 
of  much  needed  automation. 

not,  necessarily  have  prodiiced  a  trouUe  fi«e 
^tem. 

Given  sound  knowledge,  continuity  of  expertise 
and  good  quality  documentation  of  the  mamml 
^tem,  the  Knaum  unknoama  cause  little 
problems.  All  that  remains  is  to  discover  the 
Unknoum  unknowns  and  eliminate  them. 

Poor  system  response  times 

Complaints  were  made  by  the  user  about 

perceived  poor  response  times.  The  IT  section 

argued  that  the  user  expectation  had  been  set  too 

high  through  lack  of  knowledge  of  computing 

systems,  and  that  the  LAN  was  performing 

satisfactorily. 

This,  however  was  not  sucepted  by  the  user  who 
demanded  that  the  complaint  be  made  direct  to 
the  supplier  otherwise  they  (the  user)  would  reuert 
to  a  manual  system. 

The  truth  of  the  matter  was,  not  that  the  LAN 
wtu  dow  but  that,  vdien  the  qrstem  was  used  by 
a  full  user  ptqnilation,  the  screen  definition  and 
software,  were  oripnaUy  accepted  at  user 
trials,  did  not  give  a  satisSactoiy  level  M  reqwnse. 

Redefinition  of  the  screen  format  and  amendment 
to  the  associated  software  soon  produced  a 
satisfoctofy  situatkin. 


SJt  Boat  boplemeHlation  Beaiew 
Systmn  Oiange  Control  (9CC)  procedures,  had 
beest  estabtished  in  January  1988  following  the 
complefem  of  file  Phase  1  conversion  process  and 
were  used  extwiaiv^  throu^wut  Phase  2 
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implementation  to  register  ail  problems  and 
observations  identified  and  record  action  taken  to 
rectify  them. 

A  Phase  2  post  implementation  audit,  as  such,  was 
not  carried  out  since  this  had  effectively  been  on 
going  under  the  SCC  procedures. 

A  committee  of  representatives  from  the  IT 
section  and  the  various  user  sections,  is 
responsible  for  the  application  of  the  SCC 
procedures,  which  are  based  on  methodologies 
approved  DGITS. 

This  committee  meets  once  a  month  and  allocates 
priorities  and  sets  completion  dates  for  registered 
changes.  It  is  also  responsible  for  monitoring 
progress  of  changes  in  band. 

By  mid  1989  the  DMCS  had  settled  down 
sufficiently  for  users  to  undertake  the  task  of 
producing  comprehensive  documentation  of  the 
derical  aspects  of  all  tasks  carried  out  by  DRIC. 

Each  user  section  was  tasked  with  producing 
current  working  procedures.  To  help  with  this 
task  the  IT  section  produced  modified  dataflow 
diagrams  for  each  known  function  within  the 
DMCS.  The  user  sections  then  wrote  the 
procedures  around  these  diagrams. 

The  final  draffs  of  these  procedures  have  been 
checked  and  approved  by  senior  management.  As 
changes  occur  the  procedures  are  updated  and 
reviewed.  If  these  are  available  at  specification 
and  design  stage  it  saves  valuable  time  later. 

3.7  Recent  AetUMea 

Having  completed  Phases  1  and  2  and  in 
accordance  with  the  originai  assessments,  the 
magnetic  disk  capacity  was  increased  from  1.8Gb 
to  3.2Gb  early  in  1988. 

As  confidence  in  the  system  grew  and  the  user's 
knowledge  increased,  his  requirements  became 
more  demanding.  It  soon  became  clear  that  the 
computer’s  Central  Processor  UnitfCPU)  would 
not  have  the  power  to  sustain  the  new  demands 
being  made. 

A  siang  exerdse  carried  out  in  mid  1989  showed 
that  the  CPU  usage  was  regularty  in  the  region  of 
95%. 

An  acceptable  usage  level  is  in  the  region  of  75% 
and  so  procurement  action  was  initiated  to 
tqigrade  the  DEC  VAX  8250  to  a  DEC  VAX  6310. 


'The  6310  machine  offers  4  times  the  processing 
power  of  the  original  8200  computer  installed  in 
May  1987.  This  is  now  the  base  of  the  current 
configuration  shown  at  2.2  above. 

All  user  sections  have,  where  necessary,  been 
supplied  with  extra  office  furniture,  including  foot 
rests  and  document  stands. 

3A  Observations 

Those  of  you  setting  out  on  a  similar  venture  may 
be  interested  in  the  following  observations; 

Good  quality  documentation  of  the  existing 
manual  system  is  vital  to  a  successful 
implementation. 

The  personal  needs  of  staff  must  be 
considered  throughout  development  and 
implementation. 

It  is  most  important  to  ensure  that  good 
relationships  exist  between  the  IT 
practitioners  and  the  user  at  all  times.  Easy 
to  say  but  difficult  to  achieve. 

User  expectations  and  ambitions  must  not  be 
allowed  to  rise  beyond  sensible  and  practical 
limits  but  how  do  you  stop  them! 

In  DRIC's  case  the  presence  of  an  in-house  IT 
section  was  invaluable.  The  use  of  outside 
consultants  is  fine,  but  when  problems  arise 
after  they  have  gone  who  picks  up  the  pieces? 

Finally  it  must  be  stressed  that  the  Technological 
problems  were  insignificant  compered  to  the 
human  aspects. 


*  British  Crown  Copyri^t  1991/MOD 
Publidmd  with  the  permission  of  the  CtmtroUer  of 
Her  Britannic  Mqe^’s  Statkmeiy  Office 
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1  Sommaire: 

Les  bases  de  donn£es  documentaires  accessibles  en 
ligne  sont  encore  utilisdes  par  un  nombre  trop 
restreint  d’utilisateurs.  Ce  phdnom^nt  est  du  ^  des 
causes  tr6s  diverses  comme  la  diflicultf  d’une 
manipulation  efiicace  des  langages  d’interrogation, 
I’hdtdrogdnditd  de  ces  langages,  la  diversity  des 
interlocuteurs  avec  qui  il  faut  passer  contrat  pour 
accdder  I’information,  la  lourdeur  des  procedures  de 
connexion,  les  couts,  la  difliculld  de  choisir  la  bonne 
base  de  donndes  et  le  bon  serveur  pour  rdsoudre  un 
problime  particulier. 

Le  problime  du  coflt  devrait  etre  risolu  avec  une 
augmentation  de  I’utilisation  des  bases,  cette 
augmentation  passant  par  le  traitement  des  autres 
points.  La  resolution  des  difTicultes  citees  d-dessus 
peut  etre  rdalisee  en  differents  points  de  la  chaine  de 
I’information.  On  peut  envisager  de  mettre  une 
certaine  intelligence  sur  le  poste  de  travail  de  la 
personne  qui  interroge  mais  ceia  I’oblige  e  disposer 
d’une  puissance  de  calcul  et  d’une  capadte  de 
stockage  importante,  cela  peut  etre,  aussi,  realise  sur 
le  serveur  lui  mSme  mais  cela  ne  fadlitera  I’accds  qu’k 
ses  propres  bases,  enfin  cela  peut  etre  fait  sur  un 
systeme  puissant  autonome:  le  gateway,  c’est 
probablement  i  ce  niveau  que  le  maximum  des 
problemes  peut  le  plus  fadlement  Stre  resolu.  C’est  la 
raison  pour  laquelle,  bien  qu’une  partie  importante  de 
ce  que  nous  allons  developper  puisse  Stre  mis  au 
compte  de  station  ou  de  pretraitement  sur  serveur, 
nous  avons  prdfere  nous  concentrer  sur  les 
fonctionnaliies  realisables  par  les  gateways. 

Le  but  n’est  pas  id  de  lister  les  fonctionnalitds 
proposdes  par  les  gateways  existants  mais  de  donner 
I’ensemble  de  celles  que  la  ledinologie  actuelle 
permet  d’envisager  i  court  terme. 

2.1  FaciiM  cooIrKtacBc  d  de  coapteMiltf: 

Le  tout  premier  intdrCt  du  gateway  est  de  simplifier 
les  aspects  contractueb  et  comptables  en  ayant  un  seui 


interlocuteur  pour  toutes  les  bases  auxquelles  on  veut 
accdder  quel  que  soit  le  serveur.  On  peut  meme 
envisager  une  absence  de  contrat  explidte  avec  un 
service  accessible  par  le  numdro  de  la  carte  de  crddit 
dans  les  pays  oh  cela  est  autorisd.  L’idde  fran^aise  de 
I’accds  de  type  kiosque,  oh,  quel  que  soit  le  service,  la 
facturation  est  incluse  dans  la  facturation  gdndrale  du 
tdldphone,  est  un  moyen  simple  de  fadliter  I’accds  h 
un  public  nombreux.  En  effet,  aucune  ddmarche  n’est 
ndcessaire  pour  accdder  h  I’information  au  moment  oh 
elle  devient  ndcessaire.  L’inconvdnient  de  ce  service 
est  qu’il  suppose  un  accord  avec  le  serveur  pour  une 
tariflcation  particulidre. 

2.2  Facilitd  de  connexion: 

Li  aussi,  le  fait  de  n’avoir  qu’un  seul  interlocuteur  est 
un  atout  important.  La  procddure  de  connexion  est 
toujours  identique  et  peut  dtre  enregistrde  une  fois 
pour  toute. 

23  Unicitd  du  langage  d’intenrogation; 

Une  des  grandes  difiicultds  pour  I’accds  aux  bases  en 
ligne,  mdme  pour  des  documentalistes  professionnels, 
est  d&e  au  fait  que  chaque  serveur  a  son  propre 
langage  d’interrogation.  Les  diffdrents  langages  sont 
souvent  trds  proches  mais  loin  de  fadliter  la  tache  cela 
amdne  h  de  nombreuses  confusions  pour  ceux  qui 
doivent  changer  souvent  de  serveur. 

L’un  des  intdrdts  importants  d’un  gateway  est  d’offrir 
h  I’utilisateur  la  possibilitd  d’un  langage  unique 
d’interrogation  pour  tous  les  serveurs.  C’est  le  Iqgidel 
du  gateway  qui  traduit  la  requdte  du  langage  du 
gateway  vers  celui  du  seiveur.  Ce  type  de 
fonctkmoalitd  serait  inutile  si  les  serveurs  adt^aient 
un  langage  commun,  mais  force  est  de  constater  que 
la  tentative  de  gdndraliser  I’usage  du  Commtm 
Command  language  (CCL)  n’a  pas  €U  un  r6el  succ^ 

La  plupart  des  gateways  permettent  aussi  d’utSiser  le 
langage  originel  des  serveurs  pour  les  utilisateurs  tris 
habituds. 

Les  langages  communs  proposds  par  les  gateways 
restent  des  langages  qu’O  faut  apprendre.  Le  meilleur 


langage  commun  est  sans  doute  le  tangage  naturel,  et 
en  paiticulicr  la  langue  que  Ton  parle  tons  les  jours 
meme  si  ce  n'est  pas  celle  dans  laquelle  est  rddigde  la 
base  de  donndes.  Nous  traiterons  ces  points  i  propos 
des  aspects  “facilitd"  et  "efficacitd*  de  la  recherche. 

2.4  Requites  multibases  et  multiservcurs: 

Certains  gateways  offrent  la  possibiliti  de  lancer  en 
paralllle  (ou  au  pire  successivement)  la  meme 
requete  sur  plusieurs  bases  meme  si  elles  sont  sur  des 
serveurs  diffdrents.  Ce  type  de  service  est  beaucoup 
plus  facile  i  rdaliser  sur  un  gateway  que  sur  la  station 
de  I’utilisateur.  Cela  implique  que  t’on  puisse  ensuite 
identifier  les  doublons  de  fa(on  i  presenter  i 
I’utilisateur,  dans  la  mesiu-e  du  possible,  des  risultats 
sans  redondance.  Ce  type  de  traitement  pent  etre  fait 
aussi  bien  au  niveau  du  gateway  qu’^  celui  du  poste  de 
I’utilisateur. 

2.5  Profll  utilisateur. 

Bien  servir  un  utilisateur,  c’est  etre  capable  d’adapter 
son  dialogue  i  ses  connaissances.  En  gdndral  ce  type 
d’adaptation  est  assez  dllmentaire,  un  utilisateur  peut 
etre  "expert”  ou  "novice".  On  lui  propose  dans  ce  cas 
des  dialogues  plus  ou  moins  longs  pour  aboutir  i  des 
commandes  i  exicuter.  Par  exemple,  I’expert  qui  est 
censi  connaitre  le  langage  d’interrogation,  peut  poser 
sa  question  directement,  tandis  que  le  novice  rdpondra 
^  une  sirie  de  questions  permettant  au  systdme  de 
construire  la  requite. 

II  est  possible  d’envisager  des  outils  de 
personnalisation  beaucoup  plus  puissants  en 
s’inspirant  des  travaux  ricents  dans  le  domaine  de 
I’EIAO  (Enseignement  "Intelligemment"  AssistI  par 
Ordinateur)[l].  La  personnaliti,  le  domaine  de 
connaissance,  les  domaines  d’intiret,  la  connaissance 
ou  non  du  vocabulaire  d’un  domaine  par  I’usager 
peuvent  etre  observis  par  le  systime  d’aide  au  fur  et  i 
mesure  du  dlroulement  du  dialogue  avec  le  systime 
d'interrogation.  Le  systime  d'aide  qui  connait  de 
mieux  en  mieux  I’utilisateur  peut  lui  proposer,  alors, 
des  aides  adaptles. 

2.6  Lc  cfaoix  des  bases: 

Les  critlres  de  choix  des  bases  sont  de  deux  ordres. 
Tout  d’abord  il  faut  trouver  les  bases  qui  sont 
susceptibles  de  ripondre  au  probllme  posl.  Ensuite  il 
faut  disposer  de  critires  de  choix  Itablis  sur  les  cofits 
d’interrogation  ou  sur  la  couverture  des  base  qui 
peuvent  Itre  disponibles  sur  plusieurs  serveurs. 

Le  probUme  le  plus  difficile  est  le  chobi  des  bases  i 
partir  de  leur  contenu.  La  solution  la  plus  simple  i 
mettre  en  oeuvre  est  de  permettre  un  choix 
arborescent  des  domaines  sdentifiques  qui  amine 
I’utilisateur  k  situer  son  probUme  par  rapport  I 
I’ensembie  des  domaines  possibles.  Cette  approche  ne 


pennet  pas  un  choix  tris  fin  car  une  telle 
arborescence  ne  peut  Itre  tris  grande  sans  devenir 
difficile  I  explorer. 

Une  autre  attitude  consiste  I  rdaliser  une  base  des 
bases.  Chaque  article  dlcrit  le  contenu  d’une  base 
sous  forme  de  mots-clls,  de  vocabulaires  fibre  ou  d’un 
risuml.  Elle  contient  aussi,  bien  entendu,  la  liste  des 
serveurs  qui  la  propose,  et  Iventuellement  le  nombre 
de  documents  et  la  liste  des  champs. 

Cette  base  peut  etre  interrogle  de  maniire  classique 
en  dicrivant  le  domaine  de  recherche  sous  forme  d’un 
fonction  boollenne  de  mots. 

Pour  les  systimes  disposant  d’une  interrogation  en 
langage  naturel  avec  un  module  de  reformulation,  il 
est  possible  de  poser  directement  I  la  base  des  bases, 
la  question  que  I’on  va  soumettre  I  la  base  qui  sera 
choisie. 

Pour  cela  apris  avoir  fait  une  analyse  finguistique  de 
la  question,  le  vocabulaire  est  noimalisi,  des  rigles  de 
reformulation  automatique  de  type  terme  spddfique 
vers  gdndrique  sont  utifisdes  pour  gdndrafiser  la 
question  et  permettre  un  rapprochement  plus  facile 
avec  les  descriptions  des  bases  qui  ne  peuvent  etre 
trds  profondes  dans  tous  les  domaines.  Certains 
termes  gdndraux  sont  produits  k  partir  de  plusieurs 
termes  de  la  question  confirmant  qu’ils  caractdrisent 
assez  bien  le  domaine  demandd.  Un  calcul  de 
proximild  sdmantique  entre  la  question  gdndralisde  et 
les  descriptions  des  bases  est  rdalisd  permettant  de 
proposer  des  solutions  dans  un  ordre  ddcroissant  de 
pertinence. 

Les  rdgles  de  reformulation  permettant  la 
gdndralisation  peuvent  etre  construites  en  grande 
partie  automatiquement  I  partir  de  thesauri  quand  ils 
existent. 

Exemple  (prototype  MERfBEL  |2]  s’appuyant  sur 
SPIRIT): 

Quelles  sont  les  rdfdrences  sur  la  microanalyse 
d’dchantillons  gdologiques:  application  I  I’uranium. 

Proposition  de  bases  (par  ordre  ddcroissant  de 
pertinence)  aprds  comparaison  avec  la  base  des  bases: 

EDF-DOC ,  INIS,  INSPEC,  PASCAL 

Aprds  chobc  d’une  base  et  ensuite  du  serveur  s’il  y  en  a 
plusieurs,  la  question  en  langage  naturel,  dont 
(’analyse  finguistique  est  ddjd  faite,  peut  dtre  traduite 
directement  en  un  requdte  adaptde  au  serveur. 

2.7  Fadlltd  d'utiliaation  (ks  dlffdrento  modes): 

L’un  des  buts  foodamentaux  des  gateways  est  de 
permettre  une  interrogation  plus  simple,  plus 


i 


conviviale  et  si  possible  plus  efficace  que  celle  dont 
dispose  chaque  serveur. 

Pour  faciliter  I’accis  i  rinformation  pour  des 
utilisateurs  novices  ou  occasionnels,  il  peut  y  avw 
deux  approches. 

La  premiere,  ob  I’ordinateur  est  maitre,  consiste  i 
proposer  i  I’utilisateur  une  succession  de  choix  pour 
i’amener  i  prddser  ce  qu’il  veut.  Ce  mode  rdpond 
bien  au  entire  de  non  connaissance  du  sujet  par 
TutiUsateur  mais  il  peut  rendre  I’accis  It  I’information 
tris  long  par  un  nombre  tris  grand  d’interactnnti. 
Cette  approche  est  la  plus  simple  i  mettre  en  oeuvre. 

Dans  I’autre  approche,  oh  I’utilisateur  est  maitre,  la 
question  peut  £tre  directement  posde.  Dans  la  mesure 
oh  on  suppose  I’utilisateur  novice  ou  occasionnel,  la 
question  la  plus  simple  h  formuler  est  une  question  en 
langage  naturel.  Dans  ce  cas  se  pose  le  problime  de 
I’efficadti  de  la  recherche  ce  qui  signifie  que  tout 
reffort  doit  etre  rialisd  par  I’ordinateur  qui  va  devoir 
interpriter  la  question  en  langage  naturel,  I’itendre 
par  reformulation  h  d’autres  inoncds  des  mimes 
thimes,  et  enfin  constniire  une  stratigie  de  recherche 
dans  le  langage  particulier  du  serveur  choisi. 

L’interrogation  en  langage  naturel  de  bases  de 
donnies  documentaires  pose  des  problimes 
particulars  dans  la  mesure  oh  I’  on  milange  des 
entires  de  recherche  portant  sur  de  I’information 
factuelle  (  auteur,  date  de  parution,  iditeur,  etc...)  et 
des  entires  de  recherche  textuels  qui  portent  sur  le 
contenu  mime  des  documents 

II  convient  done  de  bien  siparer  ces  deux  parties  car 
la  premiire  doit  obligatoirement  etre  traitie  de 
maniire  boolienne  alors  que  la  deuxiime  sera  traitie 
soit  par  un  calcul  de  proximiti  simantique  soit  par 
une  stratigie  de  plusieurs  questions  booliennes  qu’il 
faudra  ensuite  combiner. 

exemple: 

Un  article  publii  par  Hoppe  depuis  plus  de  2  ans  sur 
les  interfaces  d'interragation  aux  bases  de  donnies 
documentaires. 

On  voit  clairement  id  que  la  premiire  partie  de  la 
question  e.’t  factuelle.  *artkle'  indique  simplement  la 
nature  de  robjet  cherchi  et  n’interviendra  pas  id  car  il 
s’agit  d’une  b^  d’artkles.  'puUii  par*  va  permettre 
(Tideiitifier  Hoppe  comme  I’auteur  de  I’article.  Cda 
peut  itre  riahsi  si  les  relatioas  entre  les 
contenus  dans  la  base  (auteur,  aitide,  itfiteurs)  soot 
parfakement  dicrites  autant  du  point  de  vue 
simantique  que  lexical  ainsi  que  lew  Ben  avec  les 
champs  de  la  base.  La  possibilki  dans  certains 
systimes  documentaires  de  demander  dans  quel 
champs  un  mot  est  prisent  permet  de  faciliter 


I’interpritation  de  la  question  quand  la  syntaxe  laisse 
des  ambiguitis. 

'publii  depuis  plus  de  2  ans*  indique  que  la  date  de 
publication  dok  itre  infirieure  h  la  date  du  jour  -  2 
ans. 

Le  mot  'sur*  introduit  en  giniral  le  thime  h 
rechercher  mais  il  convient  de  virifier  que  ce  qui  suit 
n’est  pas  interpritable  comme  critire  factuel. 

Le  risultat  de  I’analyse  sera  done  une  question 
boolienne  portant  sur  des  champs  prids: 

exemple; 

AUTEUR=Huppe  ET  DATE<  =891007 

Le  thime:  *  les  interfaces  d’interrogation  aux  bases  de 
donnies  documentaires*  doit  etre  traduit  en  langage 
d’interrogation  du  serveur.  11  est  souvent  difficile  de 
dicomposer  une  question  en  langage  naturel 
complexe  en  une  settle  question  boolienne.  En  effet  si 
Ton  impose  la  prisence  de  tons  les  mots  par  des  *ET* 
on  court  le  risque  de  n’avoir  aucune  riponse.  En 
revanche,  utiliser  des  *OU*  donnera  trop  de  bruit. 

On  peut  s’appuyer  sur  le  risultat  du  traitement 
linguistique  pour  introduire  des  opirateurs  comme 
I’adjacence,  la  troncature  ou  le  *ET*. 

Dans  notre  exemple  on  pourra  obtenir  les  opirateurs 
suivant$(ADJ  signifie  adjacence  et  ?  signifie  masquage 
d’un  caractire); 

interface?  ADJ  interrogation  ET  base?  ADJ  donnie? 
ADJ  documentaire? 

Si  cela  assure  bien  un  faible  bruit,  le  silence  risque 
d’itre  considirable.  II  peut  itre  riduit  par  I’udlisation 
de  termes  synonymes  qui  sont  substituis  i  I’aide  de 
OU  h  un  mot  de  la  question  ou  un  groupe  lii  par  des 
adjacences. 

exemple; 

interface?  ADJ  interrogation  ET  ((base?  ADJ 
donnie?)  OU  systime)  ADJ  (documentaire?  OU 
textuelle?) 

Cette  mithode  aboutk  h  des  questions  qui  deviennent 
de  plus  en  plus  complexes  et  on  se  trouve  vite  ameni  h 
tra^ormer  I’interrogation  en  une  suite  de  questions 
(stratigie  de  recherche)  en  particulier  dans  le  cas  oh 
I’interrogation  poite  sur  dm  risnmis  ou  mime  du 
texte  intigral. 

La  mithode  qui  semble  la  plus  prometteuse  consiste  i 
se  rapprocher  des  systimes  qui  proposent  une 
mtetTogtfioo  poodirie  fonmissant  une  riponse 
hiirarchisie.  La  difficuki  est  que  dans  le  cas  de 


gateway  I’accis  au  serveur  ne  peut  se  faire  que  par  ie 
langagc  d’interrogatioa  boolden  qui  donoe  cofflme 
rdpcMise  im  nombre  de  documents  et  non  la  liste  des 
identificateurs  de  ces  documents  (liste  inversde).  Cela 
rend  toute  optimisation  dn  processus  difficile.  Ea  effet 
si  les  listes  inversdes  de  chaque  mots  de  la  question 
dtaient  disponibles,  il  serait  possible  comme  dans 
SPRIT,  par  eaem|de,  de  construire  des  descripteurs 
d’intersections  de  documents  en  commenfant  par  les 
mots  les  plus  discriminants  et  d’arrdter  quand  on  est 
sflr  d’avoir  les  documents  les  plus  pertinents  sans 
poursuivre  la  recherche  jusqu’au  bout,  n  suffit  ensuite 
de  regrouper  les  documents  par  classes  rdpondant  i 
une  question  booldenne  issue  de  la  question  d’origine 
pour  presenter  un  rdsultat  synthdtique.  Cette  mdthode 
a  I’avantage  de  foumir  des  rdsultats  dquivalents  i  une 
stratdgie  de  recherehe  qui  poserait  toute  les  questions 
booldennes  que  Ton  peut  rdaliser  combinatoirement  i 
partir  de  la  question  d’origine. 

example  d’une  interrogation  SPIRIT  [3]: 

Accds  i  une  base  en  texte  intdgral  (SpdciAcations 
Techniques  d’Utilisation  du  Minitel): 

Question;  effacement  de  I'dcran  de  la  position  du 
curseur  i  la  An  de  I’daan 

classement  des  documents  rdponses: 

nro  classe  nb  document 
intersection 

1  1 

eAacement-dcran-position-curseur,An-ecran 

2  1 

eifacement-dcran,position-curseur 

3  1 
position-curseur,An,dcran 

4  1 
eAacement,dcran,position,curseur,An 

La  liste  est  arrdtde  car  les  entires  d’optimisation 
considirent  que  les  documents  qui  out  des 
intersectioos  plus  petites  et  moins  ponddries  n’ont  pas 
de  chance  de  rdpondre  h  la  question. 

Les  classes  de  documents  soot  donndes  dans  I'ordre 
ddcroisaant  de  pertinenee.  Le  tiret,  entre  deux  mots, 
indique  une  rektioa  de  ddpeodance  entre  ceux-cis 
aims  pour  notre  propot  on  peat  I’assimiler  i  une 
a^acaaee.  La  vug^  oorretpond  h  un  ET  logique.  II 
n'y  a,  bien  tar,  pas  de  OU  car  il  s’ag^  d’une 
caraetdrisatioo  de  Pintersection  a  posteriori,  tl»n«  ce 
css,  si  deox  mots  tout  prdsents,  its  soot  lids 
oUigMoireaient  par  nn  ET  mtaie  si  la  prdsence  (Pun 
serf  est  ndmttsim  ponr  considdrer  le  document 


Dans  la  mesure  oh  I’accds  aux  listes  inversdes  ,  n’est 
pas  possible  sur  les  serveurs,  II  faut  se  contenter  d’une 
stratdgie  incompidte  sous  peine  d’avoir  des  temps  de 
rdponse  et  dventueUement  des  coOts  trds  importants. 
On  peut  proedder  i  partir  de  la  question  la  plus 
restrictive  et  la  ddgrader  soit  en  passant  d’un 
opdrateur  restrictif  comme  ADI  i  un  opdrateur  plus 
faible  comme  ET  soit  en  retirant  I’ua  des  mots  de  la 
question. 

On  peut  aussi  comme  dans  les  stratdgies  de  recherche 
mannelles  combiner  deux  i  deux  puis  trois  i  trois  les 
mots  en  se  guidant  sur  les  nombres  de  rdponses 
obtenues  h  la  suite  des  questions  successives. 

Dans  tons  les  cas  de  Agure  on  a  intdrdt  h  mettre  en 
place  une  stratdgie  de  ponddration  permettant  de 
hidrarchiser  des  intersections  entre  question  et 
documents  qui  conAennent  des  termes  diffdrents. 

Cela  est  d’autant  plus  vrai  qu’h  partir  de  cette 
mdthode  permettant  de  considdrer  n’importe  quel 
texte  en  langage  naturel  comme  question,  il  est 
possible  de  prendre  tout  on  partie  d’un  document 
visuaUsd  en  rdponse  h  une  question  et  de  proposer  ce 
texte  comme  nouvelle  requdte.  Cette  technique 
permet  de  rdaliser  des  liens  hypertextes  dynamiques 
sur  des  donndcs  qui  n’ont  pas  dtd  structurdes  poor  un 
tel  usage. 

Une  demidre  approche  de  I’accds  i  I’information  par 
I’uAlisateur  An^  est  I’exploraAon  par  graphes  de 
concepts.  Cette  approche  peut  etre  considdrde  comme 
intermddiaire  entre  une  approche  enAdrement  guidde 
et  une  approche  oh  I’util^teur  h  I’entidre  initiative. 
EUe  consiste  d  faire  naviguer  I’utiiisateur  dans  un 
grapbe  de  termes  lids  par  des  relations  zdmanAques. 
Cette  navigation  permet  d  la  fois  d  I’uAlisateur 
d’apprdhender  le  contenu  de  la  base  et  de  choisir  des 
thdmes  qui  I’intdressent  et  qui  vont  composer  son 
dquation  de  recherche. 

La  manidre  la  plus  simple  de  rdaliser  une  telle 
approche  est  d’utiliser  le  thesaurus  de  la  base,  s’il 
existe,  comme  graphe  de  concepts.  Un  exemple  en  est 
donnd  par  Hyperline  de  I’ESA.  Une  telle  apin-oche  est 
intdiessante  seulement  si  I’interactivitd  de  navigation 
est  tids  rapMe  ce  qui  est  rarement  le  cas  dans  une 
ditdogue  avec  le  gateway  qui  se  fait  le  (dus  soovent  en 
mode  caraetdre  et  d  1200  Bd.  Si  I’on  veut  qn’une  telle 
approche  soit  rdellement  utilisde,  il  foot  peut  Ctre 
envisdger  des  architectures  dients-serveur  ou  le 
logiciel  dient  sur  le  poste  de  I’oAlisdeur  prendra  d  sa 
charge  toot  le  dialogue  avec  une  repidsentation 
graphique  do  grapbe. 


3  EfflcadK  de  riaterroaHoo: 

Nous  venous  de  passer  en  revue  diffdrents  moyens  de 
simplifier  l’acc£s  It  I’information  pour  I’utilisateur. 
Mais  qu’en  est-il  de  I’eSicacitd  de  ces  mdthodes.  Les 
outiis  linguistiques,  les  systdmes  de  reformulation  que 
Ton  pent  rdaliser  ont  atteint  im  niveau  suffisant  de 
fiabiUd  pour  que  leur  utilisation  procure  non 
seulement  un  confort  mais  aussi  un  rdel  gain  de 
performance  au  niveau  de  I’efficacitd  de  la  recherche. 
Nous  aliens  briivement  passer  en  revue  les 
caraetdristiques  de  ces  outiis  et  permettre  d’appr£cier 
leur  rdle  dans  le  processus  d’interrogation. 

3.1  Les  requites  en  langage  naturcl: 

Contrairement  I’interrogation  des  SGBD  en  langage 
naturel  qui  peut  s’appuyer  sur  une  connaissance  de  la 
sdmantique  de  la  base,  I’interrogation  de  bases 
documentaires  se  fait  toujours  sur  des  univers  tris 
larges.  C’est  la  raison  pour  laquelle  on  fait  largement 
appel  k  des  niveaux  d’analyse  qui  ne  dependent  pas  du 
domaine  comme  le  niveau  morphologique  ou 
syntaxique  et  que  la  simantique  est  limitde  i  de  la 
simantique  lexicale  ,  c’est  k  dire  des  relations 
simantiques  entre  mots.  Le  seui  point  oh  une 
sdmantique  plus  fine  est  possible  est  dans 
I’interpritation  de  la  partie  factuelle  de  la  question  et 
pour  aider  k  siparer  cette  partie  factuelle  de  la  partie 
portant  sur  le  contenu  des  documents. 

Le  traitement  linguistique  automatique  a  pour  but; 

-  d’identifier  comme  le  meme  mot  des 
chatnes  de  caractires  diffirentes  (synonymes, 
difforentes  forme  de  sigle  ou  de  mots  composes  avec 
tiret,  formes  dirivies  d’un  meme  mot) 

•  de  lever  dans  la  mesure  du  possible  les 
homographies  (mime  chaine  de  caractires  avec  des 
significations  diffirentes  scion  le  contexte)  par 
exemple  ('marche'  verbe  on  substantif), 

-  de  reconnaitre  les  mots  composis  et  plus 
giniralement  les  relations  de  dipendance  entre  les 
termes, 

-  de  normaliser  la  reprisentation  des  mots 
pour  la  recherche. 

Le  traitement  linguistique  joue  aussi  un  rdle  dans  la 
reformulation  dans  la  mesure  oh,  risolvant  certaines 
homographies,  il  interdk  certaines  infirences  qui 
pourraient  produire  du  bruit,  (ex:  poste  substantif 
fifflinin  -->  P.T.T.) 

Pour  conclure  sur  I’intirit  du  traitement  automatique 
du  langage  naturel,  il  faut  remarquer  que,  U  encore,  la 
niceaaiti  tPinterroger  des  donnies  qui  ont  iti 
ittdexdes  automatiquemeat  par  des  systimes  ne 
disposant  pas  (Tune  analyse  Ungnistique  est  un 


handicap.  Les  ^hmes  de  recherche  h  base  de 
linguistique  donnent  toute  leur  puissance  si 
documents  et  questions  sont  analyses  par  le  m£me 
traitement 

3.2  Les  proUteics  de  rcfonnulatioa.- 

Que  la  base  soit  index£e  par  un  vocabulaire  contr&lh 
ou  que  I’on  interroge  directement  le  r6sum£  ou  le 
texte  integral,  un  utilisateur  qui  exprime  sa  question 
en  langage  naturel  a  beaucoup  de  chance  de  ne  pas 
utiliser  les  termes  qui  sont  contenus  dans  le 
document  II  est  done  ndeessaire  de  produire,  k  partir 
de  la  formulation  initiale,  toutes  les  formulations 
possibles  dans  la  langue,  des  m£mes  concepts,  de 
fafon  h  retrouver  tous  les  documents  pertinents. 

Si  le  processus  de  reformulation  est  r£alis6  sans 
precaution,  cela  peut  avoir,  I’inconvenient  de  produire 
beaucoup  de  bruit  Toute  la  difficult  d’une  bonne 
reformulation  sera  de  diminuer  au  maximum  le 
silence  sans  pour  autant  augmenter  trop  le  bruit.  Cela 
est  possible  d’autant  plus  facilement  que  Ton  fait  une 
evaluation  ponderee  de  I’intersection  entre  question  et 
documents.  Le  systhme  de  ponddration  doit 
permettre,  rndme  si  la  reformulation  produit 
beaucoup  de  documents  bruyants,  de  les  mettre  en 
bas  dans  la  liste  des  documents  rdponse  classde  par 
ordre  dderoissant  de  pertinence. 

Les  donndes  linguistiques  sur  lesquelles  est  basde  la 
reformulation  peuvent  dtre  de  plusieurs  origines. 

On  peut  distinguer  des  connaissances  de  nature 
gdneralc  qui  peuvent  servir  dans  n’importe  quel 
domaine.  On  peut  pour  cela  faire  appel  h  des  listes  de 
synonymes. 

Pour  ce  qui  est  de  la  connaissance  lexicale  propre  h  un 
domaine,  on  peut  partir  de  thesauri  existants  mais 
qu’il  faut  transformer  car  ils  contiennent  souvent  des 
informations  sans  intdret  pour  un  systdme  qui  possdde 
un  traitement  linguistique:  par  exemple  In  relations 
entre  mots  ddrivfe  d’une  mdme  radne  peuvent  dtie 
pris  en  compte  automatiquement  par  le  traitement 
linguistique  (programme  ■TA->  programmeur),  une 
relation  de  spddfidtd  entre  un  mots  et  un  mot 
composd  ayant  le  premier  mot  pour  tdte  (indexation  - 
TS->  indexation  automatique). 

Il  faut  se  rendre  compte  que  I’utilisation  des  relations 
sdmantiques  comme  (synonymes,  termes  gdn^ques, 
termes  spddfiques,  termes  associds)  soot 
probaUement  trop  grossidres  pour  permettre  une 
reformulatkm  trds  fine  et  qu’3  faudra  dam  I’avenir 
dooner  des  relations  plus  proches  de  ceOes  dtabUes 
par  les  relatkms  de  ddpendances  (par  exemple  agent- 
action,  action-objet  de  Tactioa,  actioa-mstmmeat 
d’une  action,  tout-partie,  sorte  de,  etc.»)-  Malgrd  cette 
reatarque,  I’lKage  de  relations  rlasriqtics  en 
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doaunentadon  donne  des  rdaultats  ddji  tris 
intdressaiUs. 

Jusqu’id  nous  nous  sommes  intdressds  k  I’utUisadon 
de  reladons  lexicales  crddes  en  dehors  des  bases.  On 
peut  constater  qu’une  paitie  de  I'informadon 
ndcessaire  k  la  reformuladon  se  troave  dans  la  base  k 
interroger.  Cest  cela  que  Ton  utilise  implidtement  en 
posant  une  partie  d’un  document  pertinent  comme 
nouvelle  question  (traduction  de  la  requdte  initiate 
dans  un  vocabnlaire  plus  proche  de  celui  de  la  base) 
ou  comme  le  fait  J.C.  Bassano  dans  le  systime 
DIALECT  [4]  en  infdrant  de  nouveaux  mots  pour  la 
requdte  h  partir  des  documents  les  plus  pertinents 
foumis  lors  d’une  premidre  interrogation.  Les  ntots 
infdrds  sont  ceux  des  documents  pertinents  qui  sont  en 
relation  de  ddpendance  avec  ceux  de  la  question. 

On  peut  dire  enfin  que  des  mdthodes  de  traitement 
linguistique  et  statistique  permettent,  dans  le  cas  de 
gros  corpus,  de  construire  automatiquement  des 
graphes  de  termes  signiGcatifs  reli6s  par  des  relations 
sdmantiques.  Bien  que  de  tels  traitements  soient 
encore  tris  imparfaits,  its  permettent  de  diminuer 
considdrablement  le  temps  de  production  d’un 
thesaurus  ou  d’une  maniire  gdndrale  d’un  graphe  de 
termes  lids  k  une  base. 

Cela  permet  d’envisager  une  interrogation  par 
graphes  de  concepts  rdalisds  k  partir  d’un  graphe  qui 
reflate  rdellement  le  contenu  de  la  base  et  non  la 
nature  du  domaine  interrogd  comme  ce  serait  le  cas 
avec  un  thesaurus. 

II  faut  malheureusement  constater  que  ces  demiers 
outils  supposent  la  dispom'bilitd  de  la  base  en  totaiitd 
pour  dtablir  ces  graphes  et  que  s’tst  rarement  le  cas 
de  gateways  sauf  s’ils  soot  eux-memes  serveurs. 

ks  aapecta  mtiltllingiica; 

Beaucoup  de  bases  de  donndes  sont  en  anglais,  mats 
d’  autres  existent  en  japonais,  en  anglais,  en 
allemand,  en  espagnol,  etc..  La  possibilitd  de 
comprendre  on  document  dans  one  langue  dtrangire 
n’impiique  pas  forcdment  que  I’oo  maUrise 
sufRumment  celle-d  pour  interroger  eflicacement.  La 
posaibiKtd  (TinteiToger  dans  $a  langue  matemelle  des 
bases  exprimdes  dans  (Tautres  langnes  semble  d’un 
grand  intdrdt. 

Cette  intefTogation  mukiiingne  prdsente  mtme  un 
intdret  an  cas  od  rotiUsateur  ne  oomprend  pas  du  toid 
la  langue  de  b  base.  Le  bit  de  trauver  des  documeitfs 
qui  sembient  pertments  avec  dvcntwllemeig  un 
contrAfe  par  une  traduction  automatique,  mdme 
dldasealaire,  permet  une  prise  de  ddd^  de 
traduction  avec  un  nunanum  de  toque. 

Ce  proMime  de  naterrogadoa  multilingue  eat  dtudid 
dans  fe  cadre  du  prqjet  ESPRIT  EMIR  (5].  Ce  proiet 


porte  sur  I’interrogation  de  bases  de  donndes 
textuelles  mais  ii  pourrait  s’adapter  k  I’introdoction 
d’une  intenogation  mukiiingne  dans  un  gateway. 

L’imerrogatkm  multilingue  est  prise  comme  un 
probldme  particulier  de  reformidation.  Contrairement 
4  ce  qui  se  passerait  dans  le  cas  de  traduction 
automatique  de  la  question,  I’interrogation 
multilingue  ne  risque  pas  de  provoquer  du  silence  et 
avec  un  bon  systime  ic  ponddration  le  bruit  ne  doit 
pas  itre  trap  nuisiUe. 

En  effet  en  cas  de  traduction  automatique  de  b 
question,  le  systdme  est  obligd  de  choisir  une  seule 
traduction  par  mot.  si  le  sysidme  de  TAO  fait  un 
contresens,  le  systdme  de  recherche  partita  dans  une 
mauvaise  direction. 

Au  contraire,  b  reformulation  multilingue  essaiera  de 
rechercber  toutes  les  traductions  possibles.  Ceb 
pourrait  provoquer  beaucoup  de  bruit,  mais  assodd  h 
une  dvaluation  ponddrde  de  la  reformulation,  les 
expdriences  montrent  que  Ton  peut  voir  appardtre  en 
tdte  de  liste  les  documents  les  plus  pertinents  avec 
aussi,  qui  plus  est,  b  bonne  traduction  des  mots  de  b 
question.  Ceb  montre  qu’une  base  de  donndes 
textueUe  dans  un  domaine  peut  servir  de  base  de 
connaissances  pour  choisir  un  bonne  traduction  d’nn 
mol  ambigu  en  cas  de  traduction  automatique. 

Exemple  d’interrogation  multilingue  (sur  une  base  en 
texte  intdgral  de  rdglementation  nucldaire): 

Question: 

management  of  nuclear  wastes 

rdgles  de  reformulation  multilingue  utilisde  sur  les 
unitermes: 

management —  >  manieraent/direction 
/  conduite/gdrance/gestion/ 
/exploitation/adresse/savoir- 
faire/administration/direction 

nuclear  —  >  nucldaire 

waste  —  >gaspilbge/ddperdition/ 

ddtdrioration/ddpdrissement/freinte/ddcliet/ddbris/r 

dsidu/rebut/ddbbis 


Cbtsement  des  documents  tdponses  par  ordies 
ddcmbsant  de  pertinence; 

nro  cbsse  nb  document 
intersection 

1  1 
gesikm-dddiets,  nuddaite,  rd8idnt,ddbris, 
adminiitration 


2  3 
gestion-dichets^ucl^aire, 
administTadon 

3  5 
gestioii-dtcheU,  nuddaire 

Pour  trailer  un  tel  ewmple  U  faut  disposer  d’une 
analyse  syntanque  qui  pemiet: 

-  de  conditionner  la  traduction  au 
rdle  synt^xlque  du  mot  (‘arretd*  partidpe  passd  -> 
'stopped'X'arretd'  substantif-->'deCTee*) 

-  de  faire  les  transformations 

syntaxiques  en  particulier  dans  le  groupe  nominal 
(nudear  waste  ->  ddchet  nuddaire) 

-  de  traduire  globalement  des 

expressions  (pomme  de  terre  -■>  potato) 

de  permettre,  par  une 
reformulation  en  langue  source,  une  meilleure 
eflicadtd  pour  trouver  la  bonne  traduction. 

•  permettre  de  diminuer  encore  le 
silence  par  une  reformulation  en  langue  dUe. 

II  faut  souligner  que  le  point  ddlicat  est  la  traduction 
des  expressions.  Certaines  expressions,  en  nombre 
limitd,  considdrdes  comme  expressions  idiomatiques 
peuvent  dtre  repdrdes,  introduites  dans  un 
dictionnaire  ,  et  done  dtre  reconnues  lors  de  I’analyse 
de  la  question  en  langue  source.  Etant  donnd  le 
nombre  de  ces  expressions,  celles-d  peuvent  aussi  dtre 
introduites  dans  les  rdgles  de  transfer!  lai^ue  source- 
langue  able. 

Mais  en  ce  qui  conceme  les  autres  expressions,  elles 
peuvent  souvent  dtre  traduites  mot  i  mot  mais  ced 
demande  une  combinatoire  de  recherche  dans  le 
lexique  de  la  base  ce  qui  peut  dtre  coOteux.  II  existe 
aussi  malheureusement  beaucoup  d’exemples  qui  ne 
peuvent  dtre  traduits  mot  i  mot  (air  bag  ->  sac 
gonflaUe). 

n  est  tris  coOteux  de  construire  de  vastes  lexiques 
(Texpressions  avec  leur  traduction  faite  entidrement  i 
la  main.  Une  sohitioo  qui  va  dtre  expdrimentde  dans 
fe  cadre  du  pr<^  EMIR  eat  la  construction  de 
dktioonaires  de  transfeit  d’expressions  i  partir  de 
textes  ddji  traduits. 

Si  on  ne  dispose  pas  de  teb  textes  traduits,  il  reste  la 
ressource  de  traker  des  textes  monolingnes  pour 
lepdrer  let  expressions  signifkatives  du  domaine 
grice  d  un  traitemeat  Knguistiqoe  et  statistique,  et 
ensnke  tPen  hire  une  traduction  manueOe. 


en  ennra  de 

Les  fbnctionnalitds  ddcrites  {dus  hauts  sont  celles  que 
permet  la  technologie  actuelle.  On  peut  done 
envisager  des  gateways  dans  un  avenir  proche  qui 
comprendraient  tout  on  partie  de  ces  fonctioonalitds. 

Les  gateways  actuellement  en  service  comme  Easynet 
[6][7],  Infotap,  I’ESA,  prcqxKcnt  les  services  de  base 
comme  le  contrat  unique  quel  que  sent  le  serveur,  un 
langage  commun  d’interrogation  et  le  dimx  de  U  biue. 

L’ESA  propose  Hyperline  qui  est  le  tout  ddbut  de  ce 
qui  pourrait  toe  une  interrogation  par  graphes  de 
concepts. 

Certains  projets  s’attaquent  au  probihme  de 
I’efEcacitd  de  I’interrogation  plus  en  profondeur  et  y 
incluent  souvent  une  part  de  traitement  linguistique. 

On  peut  dter  par  exemple: 

-  Le  projet  IMPACT  ’MTI'I"  qui  va  permettre 
une  interrogation  multilingue  de  sept  serveurs 
europdens  en  anglais,  fran^ais,  allemand,  espagnol 
dans  le  domaine  de  la  technologie  et  de 
I’environnement.  Ce  travail  s’appuie  sur  les  recherches 
mendes  auparavant  par  les  partenaires  (systdme 
PLEXUS  B.  Vickery  Tome  Ass.[8],  EURISKO 
Barthes  et  Glize  IRIT  Toulouse  [9),  EXPRESS  Ulrich 
Hoppe  GMIXIPSI  DarmstadtjlO),  Primus,  LC-TOP 
et  les  dictionnaires  diectroniques  de  Softex). 

-  U  projet  IMPACT  ’CARTINFO'  qui  va 
permettre  d’ouvrir  des  services  dblds  sur  les  PME- 
PMI  dans  diffdrents  pays  de  la  Communautd.  Une 
dtude  de  marchd  a  permb  de  ddterminer  les  bestnns 
prdds  de  ces  entreprises  et  le  systdme  va  ponvoir 
rdpondre  k  des  requites  prdidentifides.  L’accent  a  dtd 
mb  aussi  sur  la  distribution  rapide  de  Tinformation, 
rdsultat  de  la  redierche,  par  messagerie  diectronique, 
fax  ou  courrier. 

-  Tantd'Serveur  [11]  TRIEL  rdalbd  en 
collaboration  entre  TRIEL  et  TUniversitd  de  Caen. 
Ce  prt^  s’attaque  anx  probldmes  Hngubtiqnes 
(syntaxiques  et  sdmantique)  poor  le  dioix  des  bases  et 
la  traduction  des  questions. 

L’diargbsemmit  do  nombre  (Tutilisatenrs  inlerrogeant 
les  bases  de  donndes  en  ligoe  est  indispensable  pour 
que  cette  activitd  acqoieit  un  caraetdre  de  rentabOitd 
rdcL  Cela  ne  peat  se  hire  que  si  Taoeds  eat  facBe, 
efficace,  et  pen  ooBteux. 

Les  gateways  out  on  rdle  important  k  jouer  pow 
rdpondre  k  cette  demande.  Bien  que  la  rdsolntion  de 
certains  des  probldmes  pnbse  toe  implautde  aiBenrs 


(sur  le  poste  de  I’udtisateur  ou  sur  le  serveur)  ,  le 
gateway  est  le  seul  a  pouvoir  simplifier  I’aspect 
contractuel  en  proposant  un  seul  accord  pour  un 
service  diversifid.  L’accds  multiserveur  en  paralldle  est 
ausa  un  service  que  seul  le  gateway  peut  proposer 
bien  que  Ton  puisse  envisager,  strictement,  un  poste 
d’utiH^eur  dotd  d’un  capadtd  de  liaistm  multiple. 
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Par  Us  interfaces  prisettUs  id,  on  ne 
cherche  pas  d  fournir  direclement  d  Vutilisateur  la 
rfpottse  d  la  question  posie.  On  lui  propose  im 
texte  ou  un  ensemble  de  textes.  Cela  se  traduit  par 
des  interfaces  qui  tentent  de  mettre  d  sa  disposition 
toute  une  panoplie  d'outils.  II  s'agit  d'abord  des 
moyens  d'une  interactiviU  ilaborie  et  conviviale.  II 
s'agit  ensuite  de  techniques  intelligentes  et  efficaces 
d'appariement  progressifdu  "sens"  entre  la 
question  et  I'ensemble  des  documents. 

Pour  faire  coopirer  des  outils  dont 
I'efficaciU  est  -  pour  la  plupart  -  dijd  iprouvf,  on 
constate  que  Von  est  deemment  passi  d'interfaces 
monolithiques  d  des  architectures  hybrides  ou 
"trmlti-experts" .  L'ivoluVon  des  recherches  se 
dirige  maintenant  vers  des  rialisaiions 
"connexionnistes". 


1.  Interfaces  conviviales  ou  intelligentes: 
des  constructions  monolithkincs  aux 
architectures  hybrides  multi-experts 


LI.  Position  du  probUme. 

Notre  projet  est  d'obscrvcr  certaines 
procedures  complexes  impliquees  dans  des 
interfaces  sur  des  systemes  de  recherche 
documentaire.  Ces  procedures  contribuent 
^  donner  une  apparence  conviviale, 
puissante  et  "intelligente"  tors  de 
I'udlisation  du  systdme.  D  faut  d'abord 
souligner  que  des  an>roches  apparemment 
differentes,  notamment  la  plupart  des 
discours  sur  des  interfaces  conviviales  en 
"fauiSige  natard"  ou  sur  des  inttrfaces 
intelligentes  s'lqrpuyant  sur  les  techniqires 
de  "rbitenigence  artifidellc”,  relivent 
egatement  pour  la  plupart  du  domaine 
abordeici.  On  observe  desormais  on 


nombre  important  d'applications  utilisant 
des  mecanismes  de  type  "systeme  expert" 
lors  de  la  mise  en  place  d'interfaces 
performantes  dans  le  domaine  de  la 
recherche  documentaire.  Paice  et  Smith 
[Paice  C.  86;  Smith  86]  ont  6tabli  un 
premier  6tat  de  I'art  sur  ces  themes.  Fox, 
Vickery  et  Dachelet  ont,  pour  leur  part, 
relev6  dans  des  6tudes  bibliographiques  des 
travaux  int6ressants  [  Fox  87;  Vickery  89; 
Dachelet  90],  Nous  centrons  cet  6tat  de  I'art 
non  seulement  sur  ridentification 
d'interfaces  de  type  "systfemes  experts"  dans 
des  procedures  de  recherche  documentaire, 
mais  egalement  sur  une  presentation  de 
I'architecture  utilisee  dans  ces  interfaces. 

L'analyse  et  la  comprehension  du 
langage  nature!  occupent  habituellement 
une  position  centrale  dans  toutes  ces 
interfaces,  que  I'on  s'interesse  simplement  i 
l'analyse  de  la  question  de  I'utilisateur  ou 
qu'il  s'agisse  -  de  maniere  plus  ambitieuse  - 
d'une  premiere  etape  dans  la  comprehension 
de  documents  textuels  enregistres  dans  les 
bases.  Les  techniques  de  representation  du 
sens  ont  notamment  evolue  d'une  approche 
essentiellement  statistique  vers  des 
approches  linguistico-conceptuelles.  D'une 
approche  lingnistique,  on  est  ainsi  passe 
insensiblement  h  une  approche  relevant  plus 
nettement  de  I'intelli^nce  artificielle:  d^ 
bases  de  connaissances  intetviennent,  elles 
sont  constituees  tfe  concepts  entre  lesquels 
divers  types  de  relations  sans  statut 
linguistiqiie  sont  eiablis.  Enfin,  on  se 
preoccupe  de  plus  en  plus  d'ameiiwer  le 
fonctionnement  du  systeme  et  de  satisfaire 
les  besoins  efinformation  (Tun  utilisateur 


3-2 


particulier  en  tenant  compte  de  ses 
caract^ristiques  propres  lors  d'une 
recherche.  On  dre  en  cela  parti  d'une 
approche  “recherche  cognitive". 

On  cherche  habituellement,  par  des 
sy$t2mes  dits  "experts"  s'appuyant 
explicitement  sur  des  regies,  it  rfsoudre 
efflcacerrtent  des  probldmes  complexes 
dans  des  univers  ^mantiques  restreints.  On 
dispose  pour  cela  de  grandes  quantit^s  de 
connaissances  sur  des  sujeis  precis.  Cette 
expertise  du  domaine,  correspondant  it  une 
competence  et  i  un  savoir-faire  acquis  par 
des  sp^cialistes,  doit  pouvoir  etre  acceptde 
et  communiquee  sous  une  forme  souple  et 
declarative  de  rigles.  Un  mecanime 
inferentiel  exploite  dynamiquement  et  au 
mieux,  dans  chaque  cas  particulier,  ces 
bases  de  connaissances  internes 
generalement  exprimees  sous  forme  de 
regies.  Ainsi,  un  systeme  expert  tend  ^ 
capture'  les  connaissances  d'un  ou  plusieurs 
experts  agissant  dans  des  domaines 
specialises.  Dans  notre  cas,  il  s'agit  d'abord 
de  I'intermediaire  humain  qu'est  le 
documentaliste.  Celui-ci  est  avani  tout  un 
generaliste  qui  utilise  ses  competences  et 
son  experience  pour  guider  et  aider  le 
demandeur  d'information.  Aussi,  I'expertise 
qu'il  s'agit  d'introduire  dans  le  systeme,  est- 
elle  d'abord  centree  sur  la  connaissance  des 
outils  et  des  techniques  necessaires  pour 
localiscr  et  choisir  une  base  de  doniiees, 
puis  sur  la  maitrise  des  procedures 
peimettant  de  manipuler  les  informations 
enregistiees  dans  la  base  choisie,  enfin  sur 
les  capacites  pour  comprendre  la  requ£te  de 
I'utilisateur  ft  partir  de  connaissances 
linguisdques  generales  et  pour  la  formuler 
de  maniftre  adaptee. 

Mais  il  peut  s'agir  egalemcnt  d'une 
analyse  plus  approfondie  du  sens  des 
docurtrents  et  des  questions.  Le  fonds 
documentaire  couvert  par  les  bases  de 
donnees  auxquelles  s'adresse  I'utilisateur  est 
trfts  large.  En  raison  de  la  nature 
encyclqiedique  du  domaiiK  conceme,  une 
recherche  sur  le  conlenu  ainsi  que  la 
transformation  des  termes  de  la  requite 


initiale  demeurent  un  problime  trfts 
difficile.  Ce  problftme  peut  ft  nouveau  etre 
abord6  ft  partir  de  connaissances 
linguisdques  gdndrales  portant  notamment 
sur  la  paraphrase.  Il  peut  I'etre  6galement  ft 
partir  des  connaissances  d'un  sp^cialiste  du 
domaine  traits.  Le  probldme  est  alors 
partiellement  r^solu  par  I'dlaboradon  et 
I'utilisadon  de  bases  de  connaissances 
sp^ifiques  (de  thesauri,  de  r^seaux 
s6mandques,  etc.)  qui  complfttent  les 
connaissances  g6n^rales  du  documentaliste 
sur  la  langue. 

On  met  ainsi  en  Evidence,  au  niveau 
du  documentaliste  comme  de  I'analyste, 
des  sous-  problftmes  parfaitement  ddfuiis  et 
correspondant  ft  des  traitements  spdcifiques. 
On  devine  6galement  que  les  Stapes 
d'infdrences  autonomes  sous  le  contr6le  du 
systftme,  resteront  toujours  moins 
nombreuses  que  dans  les  systftmes  experts 
plus  tradidonnels.  Les  rftgles  seront 
generalement  nombreuses  et  parfois  mal 
defmies,  elles  seront  peu  consistantes  et 
souvent  redondantes.  Il  est  done  necessaire 
de  les  structurer  et  de  disposer  de  meta- 
rftgies  ou  de  rbgles  strategiques.  Aussi, 
derriftre  les  avancees  theoriques  de  ces 
demiftres  annees,  on  remarque  que  Ton  est 
passe  de  tentatives  d'implantadon 
d'interfaces  souvent  monolithiques  ft  des 
implementations  de  plus  en  plus  complexes 
et  hybrides  prenant  appui  sur  des 
architectures  dites  de  systftmes  multi¬ 
experts.  Cette  approche  technique  apparalt 
done  comme  one  voic  efficace  pour 
resoudre  des  problftmes  poses  par  la 
recherche  documentaire.  Des  modules 
specialistes  -  dont  I'activite  est 
essentiellement  fondee  sur  des  operations 
de  "filtrage",  de  choix  et  de  propagation  de 
connaissances  trfts  speciHques-  sont  eux 
aussi  pilotes  par  des  modules  experts  en 
strat^es.  Ces  stratftges,  disposent  dans 
leur  propre  base  de  connaissances 
constitu^  de  rftgles  strategiques  et 
heuristiques.  Ils  guident  le  fonctionnement 
des  specialisies  en  declenchant,  en 
organisant  et  en  coordonnant  leur  travail. 


On  introduit  ainsi  unc  difference 
entre  des  interfaces  conviviaies  rendant 
essentiellement  compte  des  actions  du 
documentaliste  et  du  interfaces 
intdligentcs  traduisant  des  capacit^s  de 
comprehension  approfondie  de  Vanalyste. 
Nous  situerons  ces  deux  niveaux  -  d'une 
complexite  croissante  quant  it  I'aichitecture 
et  la  puissance  des  syst^mes  multi-experts 
impliques  -  dans  les  sous-sections  suivantes 
1.2  et  1.3.  Nous  en  reprendrons  les  aspects 
essentiels  it  travers  la  presentation 
d'applications  dans  les  paragraphes  2  et  3. 

1.2.  Les  interfaces  dUes  ivoluees  ou 
conviviaies,  interfaces  relativement 
simples  rSsposant  gineralement  de 
possibilitis  de  dialogue  en  langage  litre 
Ces  interfaces  regroupent  des  outils 
quc  Ton  peut  ajouter  I  un  systin^ 
documentaire  •  ou  eventuellement  ^  un 
systeme  de  gestion  de  bases  de  donnees 
utilise  pour  une  application  documentaire  - 
pour  obtenir  un  environnement  convivial 
qui  en  facilite  I'utilisation.  La  recherche 
portc  alors  sur  I'ergonomie  des  interfaces. 
L'objectif  est  de  rendre  I'interaction  de 
I'utilisateur  avec  le  systeme  documentaire 
plus  souple  et  plus  agreable.  Les  interfaces 
graphiques,  les  interfaces  en  langage 
naturel  repondent  ^  ce  souci.  Souvent  ces 
interfaces  sont  mises  en  oeuvre  dans  un 
contcxte  multi-bases.  L'utilisateur  dispose 
alors  d'un  ensemble  plus  ou  moins  complet 
de  procMures  sp&iflques  compl6mentaires. 
Ces  procedures  peuvent  etre  tealisees  sous 
la  forme  de  specialistes  ou  d'experts  qui 
assistent  I'utilisateiir  dans  le  choix  des  bases 
et  qui  prennent  en  charge  les  procedures  de 
connexion  et  de  communication  it  distance 
[voir  par  example  MESSIDOR  de 
Molinoux,  SCI-MATE  de  Stout,  IN¬ 
SEARCH,  SMARD  de  Sellami, 
MYRIADES  de  Bernard,  EURISKO  de 
Barthes;...].  Dans  ces  interfaces,  on 
developpe  tout  spedalcmem  la  convivialite 
des  processus  (ftehange  par  un  [nemier 
tiaitement  eietnentaire  du  langage  de 
l'utilisateur.  On  prend  en  charge  la  gestion 
du  dialogue.  On  aide  l'utilisateur  i  respecter 


les  r&gles  syntaxiques  et  les  regies  de 
transcodification  vers  des  langages 
^interrogation  sp&ifiques.  II  s'agit 
ftnalement  d'interfaces  6volu6es  qui  sont 
cependant  relativement  m6caniques. 

Aussi,  les  interfaces  reprises  plus 
loin  dans  le  paragraphe  2  sont  celles  qui 
permettent  avant  tout  une  interrogation  en 
langue  naturelle.  Dans  certains  cas  les 
requites  peuvent  porter  sur  le  contenu  de 
zones  de  textes  enregistrdes  dans  la  base. 
Ces  traitements  linguistiques  restent 
relativement  simples  dans  le  contexte  de  la 
recherche  documentaire;  analyse  de  la 
structure  de  la  requete  et  mise  en  Evidence 
de  la  nature  des  informations  demand€es  ( 
donnees  signal6tiques  et/ou  recherche  sur  le 
contenu );  appariement  autour  des  formes 
identifi^es  dans  cettc  requete  initiale  pour 
une  recherche  sur  le  contenu.  Pouitant  cette 
interface  d'interrogation  en  langage  iibre 
ndeessite  d6j&  ^  elle  seule  I'intervention  de 
techniques  li^s  aux  syst^mes  multi¬ 
experts.  De  tels  systfemes  d'analyse  et  de 
comprehension  du  langage  naturel  sont 
maintenant  eux-mSme  con(us  autour  d'une 
architecture  ditc  "multi-experts"  et  utilisent 
generalement  des  outils  de  communication 
fondes  sur  la  technique  du  "tableau  noir"  ou 
sur  rechange  de  messages. 

1.3.  Les  interfaces  intelligentesfondies 
sur  des  reprdsentations  ilabordes  des 
connaissances  du  dorruiine  ( la  base  de 
donnies  est,  ou  est  compKtie  par,  une 
base  de  connaissances). 

Unc  analyse  plus  approfondie  du 
sens  des  documents  et  des  questions  est 
necessaire  pour  permettre  un  meilieur 
appariement.  Des  systemes  de  recherche 
documentaire  component,  au  dellk  des 
phases  d'analyse  linguistique  d^jii  £voqu6es 
en  1.2,  des  phases  compldmentaires  tendant 
&  ^laborer  une  representation  semantique 
et/ou  pragmatique  des  enonces.  Ces 
procedures  d'analyseconduisent  d  une 
structuration  de  la  collection  de  textes. 

Une  premiere  approdie  construit 
et  exploite  alors  de  maniixe  systematique 
des  representations  semantiques  et 
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pragmatiques  issues  du  contcnu 
infonnationnel  de  la  base  elle-meme.  La 
base  de  textes  peut  tttc  considdrde  comme 
une  base  de  connaissances  assertionnelle. 
Cest  g^ndralement  la  foncdonnalit^  de 
paraphrasage  qui  cst  surtout  invoqu^e.  Les 
paraphrases  gdndrfes  par  transformation 
lexises  et/ou  linguisdque  ddmultiplient  les 
possibilit6s  d'appariement  formel  entie 
requetes  et  documents.  S'appuyant  sur  des 
connaissances  essendellement  linguisdques, 
done  inddpendantes  du  domaine 
d'applicadon,  les  transformadons  6voqu6es 
restent  g6n6rales.  Ces  techniques  de 
paraphiasage  manipulent  des 
repidsentadons  assez  proches  de  la  surface 
langagiire  et  ne  ndeessitent  dventuellement 
pas  de  recourir  I  un  niveau  de 
repidsentadon  des  connaissances  d'ordre 
conceptuel  plus  profond. 

Mais,  une  deuxiime  approche  se 
fonde  diiectement  sur  une  nodon  de  "mdta- 
documents"  qui  sont  consdtuds, 
diiectement  et  souvent  a  priori,  sans  passer 
par  une  expression  textuelle.  Ces  systbmes 
s'appuient  ainsi  sur  I'udlisadon  de  "bases  de 
connaissances"  tiis  €labordes.  Dans  ce  cas 
la  base  documentaire  tradidonnelle  -  ou 
asserdonnelle  -  est  partiellcmcmrcl6gu6c  k 
I'arridre  plan.  On  privildgie  alors  des  "bases 
de  connaissances  terminologiques"  relatives 
aux  domaines  converts,  n  en  est  ainsi  du 
thdsaurus,  oudl  forgd  pour  la  icprdsentadon 
globale  de  la  base.  Si  la  traduedon  des 
phrases  d'un  texte  en  dnoneds  est  une  chose 
difficile,  I'extracdon  automadque  de  la 
reprdsentadon  d'unc  phrase  dans  les  termes 
d'une  "base  de  connaissances"  Test  tout 
autant.  On  pent,  dans  certains  cas 
intennddiaires  et  k  mi-chemin  entre  les 
deux  approches,  tenter  de  ddcrin;  les  "mdta- 
documents"  en  passant  par  une  expression 
textuelle...  On  peut  dgalement  se  poser  le 
probldme  de  constituer  cette  base  de 
connaissances  H  partir  d'encyclopddies  ddji 
existantes. 

Dans  ces  deux  approches,  des 
adaptations  des  mdeanismes  habituels  de 
I'intelligence  ardficieile  et  des  systdmes 
experts  doivent  dtre  rdalisdes  en  vue  de  leur 


utilisation  pour  des  bases  documentaires 
vastes  et  de  nature  encyclopddique: 
jusqu'ici,  les  textes  que  I'intelligence 
ardficieile  a  rdussi  H  traiter  sont  courts  et 
reldvent  de  domaines  restreints.  Aussi,  il 
faut  dgalement  faite  apparaitre  les 
contraintes  propres  aux  applications 
documentaires  abordant  -  de  manidre 
souvent  atypique  -  des  domaines  de 
connaissances  trds  larges.  Trds  souvent,  par 
un  processus  dynamique  et  une  stratdgie 
naturelle  de  recherche  progressive  de 
I'informadon,  les  rdsultats  d'une  premidre 
requete  sont  utilisds  automatiquement  pour 
reformuler  la  demande. 

Cette  classification  a  fait  apparaitre 
des  interventions  de  systdmes  (multi) 
experts  de  plus  en  plus  compldtes  et 
imbriquifes.  Les  systdmes  fondds  sur  des 
"bases  de  connaissances"  utilisent 
probablement  des  interfaces  conviviales  et 
effectuent  ndeessairement  des  analyses 
linguistiques  des  documents  ( en  plus  de 
I'analyse  de  la  question ).  Les  moddles  de 
reprdsentation  des  connaissances  sont 
dtroitement  lids  k  la  comprdhension  cl  au 
traitement  du  langage  nature!.  Aussi,  tous 
les  systfcmcs  cherchant  it  apparier  le  sens 
disposent  gdndralement  et  dgalement 
d'interfaces  en  langage  naturel...Des 
systdmes  seront  done  citds  plusieurs  fois  en 
exemplc  en  fonction  des  aspects  prdcis 
auxquels  on  s'intdresse:  I'utilisateur  peut 
effectivement  ddclencher  dans  ces  systdmes 
des  experts  de  plus  en  plus  complexes. 

2.  Interfaces  conviviales  disposani 

notamment  de  possibilitds  d'inteirogatioii  en 
langage  nalurel;  une  premiire  dtape  vers  des 
systimes  multi-experts. 

2.1.  Un  (limeta  de  rfffrence: 

I’interrogation  en  langage  naturel  de 
systimes  de  gestion  de  bases  de  denudes. 

Ces  systdmes  permettent  aux 
utilisateurs  d'interroger  it  partir  d'une 
requdte  en  langage  naturel  des  bases  de 
donndes  contenant  des  informations 
structurdes. 
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Les  systinv,s  de  gesdon  de  bases  de 
doiindes  mampuknt  essendeliement  des 
donndes  structures.  Certains  peuvent 
cependant  permettre  renregistrement  de 
chalnes  de  caracktes  et  disposent  de 
primidves  autorisant  des  manipuladons 
spdcifiques  de  textes  et  de  documents.  On 
peut  ainsi  citer  [  Croft  et  alii.  1982; 

McLeod  et  Crawford  1983;  Miranda  1983; 
Bancilhon  et  Richard  1987  }.  Autourde  tels 
syskmes  ont  peut  ddji  proposer  des 
interfaces  en  langue  naturel  effectuant  des 
selections  sur  les  differents  champs  et 
comportant  des  fonctions  de  visualisation 
des  textes  et  des  documents.  Les  syskmes 
de  Normier,  Hendrix,  Bobrow  (Normier  et 
alii.  1985;  Hendrix  1982;  Bobrow  et  Bates 
1983  ]  montrent  d'une  manibre  gbnbrale  les 
techniques  udiisdes  pour  la  mise  en  place 
d'interfaces  en  langage  naturel  pour  des 
systbmes  classiques  de  gesdon  de  bases  de 
donndes.  Les  travaux  sur  Saphir  et 
I'extension  du  systbme  Smart-reiationnel 
proposbe  par  Fox  illustrent  plus 
particulibrement  I’bvoludon  vers  des 
analyseurs  b  base  de  rbgles  [  Normier  B.  et 
alii.  1985;  Fox  19811. 

Uxsque  Ton  soumet  a  SAPHIR  la  requ£te 
"profESMura  dc  Baths  cblibataires  qui  enseignent 
dans  plnsicurs  btabljsscments''.  il  I'analyse  comme 
’jc  vais  vovs  donncr  les  noms  des  itablissements 
et  des  professeurs,  professeurs  dans  la  natibrc 
matlibBatiqiie  et  dont  la  situathni  de  famine  est 
cdlibataire  et  qai  oat  aa  paste  dans  plus  d'uD 
itablisscmcnt’  sous  une  forme  qui  paraphrase  la 
requSte  QBE  ou  SQL  sous-jacente 

Mbme  si  certains  champs  de  la  base 
contiennent  des  zones  de  texte,  on  ne 
dispose  gbnbralement  pas  des  procbdures  de 
manipulation  des  termes  des  textes.  Or.  ces 
procbdures  sont  nbcessaire  b  la  recherche 
documentaire. 

2.2.  Adaptation  d  la  recherche 
documentaire 

D'autres  systbmes  permettent  dbjb, 
au  delb  de  sinqiles  manipulations  de  textes, 
une  recherche  sur  le  contenu.  Cette 
recherche  correspond  b  rappariement  de 


termes  dits  "descripteurs".  La  demande 
d'information,  formulbe  initialement  en 
langue  naturelle,  aboutit  gbnbralement  b  la 
recherche  des  solutions  d'une  &iuation 
boolbenne  ou  "boolbenne  btendue"  de 
descripteurs.  Dans  cette  forme  blbmentaire 
d'accbs  aux  documents  par  leur  contenu,  la 
reprbsentation  du  "sens"  des  documents 
textuels  est  libe  au  problbme  de  I'indexation 
automatique.  Des  procbdures  d'indexation 
automatiques,  plus  ou  moins  puissantes, 
sont  done  bgalement  ajoutbes  pour  ramener 
les  textes  de  la  base  b  un  ensemble  de 
descripteurs.  DIALECTl  [Bassano  1986]  a 
btb  construit  comme  une  interface  sur  le 
systbme  classique  de  gestion  de  base  de 
donnbes  ADABAS.  Ce  progiciel  possbde 
des  caraetbristiques  appropribes  pour  la 
nuse  en  place  d'une  telle  interface.  Lors  de 
I'insertion  des  textes  dans  la  base,  on 
construit  un  fichier  inverse.  Les  mots 
pleins  sont  identifibs  et  les  mots 
grammaticaux  trop  frbquents  et  sans  poids 
sbmantique  sont  retirbs.  II  s'agit  done 
essentiellement  d'une  reconnaissance  de  la 
forme  des  mots  b  partir  de  dblimiteurs  et  de 
signes  de  ponctuation,  de  I'blimination  des 
sbquences  terminales  pour  rbaliser  une 
normalisation  des  formes. 

Par  exemple.  pour  la  requbte  "recherche  des 
documents  trattanl  des  systbmes  documeulaires, 
bcrils  par  Bassano",  DIALECT:  identifie  les 
termes  "systbme"  et  "documentaire"  pour  le  champ 
descripteur  et  les  relie  par  "ET",  relbve  le  terme 
"Bassano"  pour  le  champ  auteur.  En  cas  d'echec,  le 
systbme  tiansformera  certains  opciateurs  "ET"  en 
"OU". 

Une  nouvelle  btude  est  en  cours  autour  du 
logiciel  orientb  objets  02  [  projet  Altair, 
Bancilhon  et  alii.  1987]. 


Mais  on  peut  bgalement  utiliser  des 
systbmes  construits  dbs  I'origine 
spbcifiquement  pour  la  recherche 
documentaire.  Ces  systbmes  prbvoient  done 
dbjb  des  possibilitbs  de  recherche  sur  le 
contenu. 
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Le  systime  IOTA  de  I'^uipe  de 
Chiaramella  [Defude  1984,  etc.]  propose 
une  interface  en  langage  naturel  pour  un 
systime  de  techeiche  bibliographique. 
L'analyse  de  la  requete  permet  le  passage 
d'une  forme  syntaxiquement  riche  et  done 
potentiellement  ambigue  k  la  forme  simple 
et  non  ambiguS  de  F&iuation  de  recherche 
bool&nne.  L'analyse  est  centrde  sur  la 
reconnaissance  des  groupes  nominaux,  les 
mots  employes  dans  la  requite  sont  connus 
du  syst&me.  On  identifie  les  diflifrents 
concepts  et  les  opdrateurs  logiques  les 
reliant,  on  associe  k  chaque  concept  les 
teimes  correspondant  du  langage 
d'indexation.  Le  systime  ALEXIS  de  la 
soci^td  Erli  ou  le  systime  de  Gauch  [Gauch 
1988]  sont  d'auties  exemples  de  ce  type 
d'interface  sur  des  syst^mes  documentaiies 
sp6:ifiques. 

A  partir  de  la  question  'mobilier  contemporain  en 
boia  massir  ALEXIS,  DIALECT  ou  IOTA 
idenlifient  dans  un  premier  temps  les  syntagmes 
"mobilier  contemporain*  et  "meuble  en  bois 
maasir. 

Le  systime  SPIRIT  utilise,  pour 
I'indexation  automadque,  des  oudls 
syntaxiques  et  lexicaux  puissants  lui 
peimettant  dgalement  d'idendfier  des  unites 
syntagmadques  valides.  Lors  de  la 
recherche  portent  sur  des  donnas 
textuelles,  il  met  en  oeuvre  une  procedure 
d'appariement  complexe.  Le  principe  est 
d'une  part,  de  confirer  aux  termes  des  poids 
it  partir  de  techniques  probabilistes  et 
d'autre  part  de  calculer  des  distances  entre 
requites  et  documents. 

A  partir  de  b  question  "un  collier  en  dent  de 
reqaiB',  SPIRIT  identifie  les  mots  vides  (un,  en, 
de)  el  les  mots  pieins  (collier,  dent,  reqoin).  II 
repire  dgalement  le  syntagme  "dent-iequin".  La 
ebsse  des  documenu  les  plus  pertinents  contient 
des  lexles  possd^nt  les  termes  ’dcat-reqnm"  et 
"coUier".  La  suivanie  ne  contiendra  qne  des 
documents  repCrds  par  le  syiUagme  *dcnl>rcqm*. 
Les  documents  de  b  troisiime  ne  possMerom  plus 
que  te  lenne  ’coBicr’,  etc. 


On  propose  une  nouvelle  version  de 
DIALECn*  interfacie  sur  le  systime  Spirit  [ 
DIALECT2  91  de  Mekaouche], 


L'analyse  de  la  requete  initiale,  les 
techniques  linguistiques  utilisies  pour 
I'indexation  automatique  des  textes  se 
fondent  done  sur  des  traitements 
moiphologiques  et  syntaxiques  relativement 
simples.  Ces  traitement  permettent 
d'extraire  des  unitis  syntagmadques 
syntaxiquement  (et  simandquement  ?) 
valides.  Une  grande  part  des  travaux  du 
CRISS  de  Rouault  et  de  I'iquipe  de  Bouchi 
est  igalement  consacrie  it  une  entreprise  de 
ce  type.  On  peut  igalement  consulter  les 
travaux  de  Lancel.  Des  orientadons  vers 
une  architecture  muld-experts  apparaissent 
tiis  clairement  dans  tous  ces  travaux.  Une 
attendon  toute  pardculiire  doit  etre  portie 
aux  analyseurs  k  base  de  ligles  [  Rouault  et 
Lallich;  Marcus;  Charniack;  Rady; 
Fouqu6r6;  etc.]  qui  sont  -  ou  qui  peuvent 
etre  -  udlisis  dans  ces  interfaces.  Diffirents 
spicialistes  manipulent  des  connaissances 
linguistiques  spicifiques:  spicialiste  des 
entries  lexicales,  spicialiste  des 
homographes,  spicialistes  des  lexis, 
spicialiste  des  mots  gammadcaux  par 
exemple  [  Mekaouche  90  ].  A  ces  experts 
linguistes  s'ajoutent  iventuellement  les 
experts  dij^  citis  en  1.2  sur  le  choix  des 
bases  et  sur  les  techniques  de 
communication. 
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3.  Interfaces  intelligentes:  accte  i 
I'information  lorsque  la  base  de  donnies  est,  ou 
est  compUtte  par,  une  base  de  coanaissances. 

3.1.  utilisation  du  paraphrasage  et  des 
textes  de  la  base  comme  enonces  d'une 
composante  assertionnelle. 

3.I.I. _ Un  fitment  (te  r^f^rence: 

svsltmes  classioiies  dii  tvne  Question- 

rtponse  nrenant  anoui  sur  des  bases  de 

textes  (avec  ou  sans  aciivitf  inKreniielle^. 

Partni  Ics  approches  strictement 
linguistiques,  mentionnons  celte  pratiqu6e 
depuis  pris  de  vingt  ans  par  i‘£quipe  de  N. 
Sager  [  Sager  N.  78  et  le  "linguistic  string 
project  ].  Les  textes  sont  des  textes 
scientifiques  et  techniques  appartenant  i  un 
domaine  sp^ifique  homog^ne.  Les 
traitements  linguistiques  aboutissent  k 
pouvoir  convertir  les  dnoneds  en  instances 
d'un  nombre  r£duit  de  schemas  d'€nonc6s. 
Ceux-ci  donnent  lieu  ^  une  formalisation 
sous  forme  de  tables  relationnclles.  Ces 
tables  sont  utilisdes  ensuite  pour  diffdrentes 
applications  question-rdponses  reposant 
fvnalcment  sur  une  forme  "d’intenogation 
de  bases  de  donndes  structurdes".  La 
rdussite  attestde  de  la  procdduie  d'analyse 
est  lide  i  I'homogdnditd  sdmantique  des 
classes  d'dquivalence  syntaxique  et  k  la 
idcurrence  d'un  nombre  faible  de  schdmas 
canoniques  d'dnoncds.  Lors  de  la 
formulation  de  la  lequete,  on  n'observ-e  pas 
d'activitd  infdrentielle. 

Unc  autre  approche,  encore 
nettement  marqude  par  la  linguistique  mais 
utilisant  parfois  des  rdgles  ou  des 
reprdsentations  de  nature  sdmantique,  peut 
etre  citde  ici.  Elle  proedde  essentieilement  I 
des  analyses  lexicales  morphologiques  et 
syntaxiques.  La  repidsentation  syntaxique 
permet  dventuellement  de  ddriver  une 
repidsentation  sdmantique.  Lr;  systdmes  de 
D.  Kayser  et  de  son  dquipe  e.  i  sont  un  bon 
exemple.  Ils  travaillent  sur  le  texte  de  la 
question  et  sur  des  fragments  de  textes 
trouvds  dans  la  base  de  connaissances  ads 
spdcialisde  pour  produire,  par  des 
infdrences  de  niveaux  variables,  la  rdponse 


k  la  question.  Dans  certains  cas,  on  reste 
trds  proche  de  la  structure  de  surface  des 
textes.  On  dispose  clairement  de 
mdcanismes  de  reformulation  et 
d'infdrence. 

A  la  question  "est-il  normal  qu'un  enfant  de  6 
mois  ne  sache  pas  encore  marcher?*,  la  rdponse 
est  *oui.  avec  une  plaasibilild  de  90%  car  nous 
avons  trouvd  dans  la  base  QUID  I'affirmation: 
I'enfant  commence  h  marcher  sent  (h  i'age)  de  12 
h  18  mois*  Cette  ddduction  a  utilisd  un 
raisonnement  bidirectionnel  et  a  erdd  17  noeuds. 
Neuf  itgles  ont  dtd  successivement  ddclenchdes. 

Dans  le  paragraphe  suivant,  nous 
niontrons  que  des  adaptation  de  ces 
techniques  permettent  d'obtenir  des 
idsultats  intdressants  lorsque  les  textes  sont 
longs,  nombreux  et  portent  sur  des 
domaines  non  spdcifiques.  De  fagon 
conaolde,  des  fragments  de  textes  rdsultant 
d'une  premidre  recherche  sont  utilisds  pour 
reformuler  la  requete.  L'augmentation  de  la 
prdcision  des  rdsultats  est  souvent 
spectaculairc  [  voir  par  exemple  Salton  G. 
et  Bassano  J-C.  ]. 

112. — Adaclaiion  a  la  rechcrcht 

documentaire. 

Le  systdme  SPIRIT  de  C.  Fluhr  et 
de  son  dquipe  visualise  un  ensemble  de 
textes  susceptibles  de  rdpondre  d  une 
requete.  ces  textes  sont  classds  par  ordre 
ddcroissant  de  pertinence.  Le  systdme 
permet  d'utiliser  le  texte  le  plus  pertinent 
comme  nouvelle  question  d  soumettre  au 
systdme.  II  s'agit  alors  d'une  proeddure  avec 
reformulation  par  le  document  le  plus 
pertinent  Dans  SPIRIT  [  Ruhr  85  et  91, 
Debili  88  ],  e'est  done  I'utilisateur  qui 
sdlectionne,  parmi  les  textes  reaouvds  les 
parties  qui  pieuvent  servir  de  nouveaux 
points  de  ddpart.  On  dtablit  ainsi  une  liaison 
dynamique  entre  des  textes  ou  des  parties 
de  textes,  liaisons  trds  voisines  de  la  notion 
d'hypertexte  dynamique.  Mais  Ton 
n'observe  pas  de  tentative  de  reformulation 
automatique,  progressive  et  autonome  de  la 
part  du  systdme. 
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Le  systfemc  DIALECT  utilise 
dgalement  les  r^sultats  des  recherches 
prdc6dentes  pour  rcformulcr  la  requcte.  II 
nc  rctient  que  Ics  phrases  les  plus 
pertinentes  et  non  ane  parde  du  document. 
DIALECT  conduit  cette  operation 
automatiquement  en  mettant  en  oeuvre  des 
outils  d'analyse  linguistique  et  des 
procedures  de  controle.  La  question  de 
i'utilisateur,  ecrite  en  langage  naturel,  est 
analysde  puis  utilisde  pour  extraire  un 
premier  noyau  de  "zones  de  texte"  trfes 
pertinentes.  Ces  zones  sont  ^  leur  tour 
analys^es  et  exploitdes  en  vue  d'enrichir  la 
question.  Le  systeme  s'appuie 
essentiellement  sur  des  procedures 
d'analyse  distributionnelle  permettant  de 
reperer  des  regularites  syntaxiques 
formelles.  Ces  reformulations  sont 
relancees  automatiquement  jusqu'i 
I'obtention  d'une  condition  d'arret. 

Partant  d'une  requSle  sur  "I'^valuation  des 
systimes  de  documentation",  la  reformulation 
propose  une  requite  transformee  que  I'on  peut 
paraphraser  par  "module  ou  critire  pour  une 
evaluation  du  cout,  de  I'efficacite  ou  des 
performances  de  systimes  ou  programmes;  ces 
systemes  ou  programmes  concernent  la 
documentation  automatisie,  la  recherche 
documentaire  ou  bibliograhique  on-line;  (il  peut 
s'agir  de)  methodes  traditionnelles  d'evaluation". 

De  I'approche  linguistique,  nous 
passons  insensiblement  i  I'approche 
intelligence  artificielle  en  privil^giant  la 
repr6sentation  et  la  mod^lisation  des 
connaissances. 


J.2.  Utilisation  d'une  composante 
teminologique. 

3.2.1. _ Un  element  de  reference:  les 

SYsfemes  Queaion-r^txtnse  classiaues 
fondts  sur  une  representation  dlaboite  des 
connaissances 

Ces  syst^mes  utilisent  de  larges 
bases  de  connaissances  construites  autour 
de  domaines  sp^cifiques.  P-s  formalismes 
du  type  grammaires  de  c.  dnarios 
(frame),  r6seaux  s^mantiq...,.s,  scripts,  etc. 


sont  utilises  pour  reprdsenter  les 
connaisssances.  On  peut  citer  ici  les 
systfcmes  de  I'dquipc  de  Schank  ou  ceux  de 
r^quipe  de  G.  Sabah  [  Vilnat  84  ]. 

Un  sysl&me  prototype  a  propose  par  Vilnat 
pour  I'interrogation  des  "pages  jaunes’  de  I'annuaire 
dlectronique.  PartanI  de  la  question  "je  chercbe  i 
rfparer  mon  auto-radio”,  ce  systeme  explique 
qu'il  vaut  mieux  passer  par  un  garagiste  puisque 
I'auto-radio  fait  partie  de  I'automobile.  Aprfes  s'Stre 
renseignd  sur  I'adresse  du  demandeur,  il  propose  les 
garagistes  les  plus  proches. 

Pour  ddplacer  le  problSme  lid  h  la 
construction  des  ces  bases  de 
connaissances,  certains  travaux  proposent 
I'utilisation  du  langage  naturel,  ou  plut5t 
d'un  langage  pseudo-naturel,  pour  exprimer 
ces  connaissances  [  Wilks,  Zarri  par 
exemple].  D'autres  cherchent  4  les 
construire  automatiquement  it  partir 
d'encyclopddies.  Ces  techniques  restent  trds 
largement  du  domaine  de  la  speculation. 

L4  encore,  des  adaptations 
intdressantes  pour  la  recherche 
documentaire  sont  prdsentdes  dans  le 
paragraphe  suivant. 

122. _ Adaptation  a  la  recherche 

documentaire. 

Au  lieu  d'essayer  d'approfondir 
I'analyse  linguistique  d'un  texte,  le  systdme 
RUBRIC  de  Tong  recherche  des  indices  - 
sous  forme  de  mots  -  lui  permettant  de 
repdrer  des  concepts  connus.  Ce  systdme 
est  viable  parce  que  I'analyse  linguistique 
est  relativement  pauvre  et  parce  que  la 
reprdsentation  h  obtenir  est  ddj^ 
partiellement  connue  ^  partir  de  la  base  de 
connaissances.  Le  systdme  HAVANE  de 
Bose  est  dgalement  capable  d'extraire  du 
texte  d'une  petite  annonce  une 
reprdsentation  qui  permettra  ensuite  de 
rdpondre  ^  une  requdte  dmise  par  un 
utilisateur.  Le  systdme  DOXIS  de 
Membrado  rdalise  une  indexation 
conceptuelle  sur  des  comptes  rendus  en 
langage  mddical.  Le  texte  est  rdsumd  par  un 
ensemble  de  concepts.  On  obtient  ainsi  le 


sens  g£n6ral  du  texte  et  Ton  pent  le 
comparer  avec  une  question  dventueile. 
L'analyse  des  textes  est  essenciellement 
s^mantique.  Les  concepts  et  leurs  contextes 
d'interpr^tation  sont  connus  et  enregistrds 
dans  un  dictionnaire. 

DOXIS  resume  te  texte  'Astrocytoine  parietal. 
Artfriographie  carotidicnne  (iode)  de  profll. 
Temps  artfriel.  Processus  expansif  avasculaire. 
Encorbellement  par  arttrc  cir^brale  anUrieure 
et  les  branches  de  la  sylvienne._*  par  I'ensemble 
des  concepts:  tumear44lenc<phale,  imagerici 
cou(artire),  image(technique  employ^),  aspect, 
cerveau(artire),  aiiatomie(ditail),  anormal. 

On  peut  dgalement  citer  ici  d'autres 
applications  comme  les  travaux  de  S. 
Gauch,  le  systime  Alexis  d'ERLI.  Ce 
dernier  syst^me  est  destine  it  la  gestion  de 
dictionnaires  complexes.  D  comporte  un 
module  d'analyse  de  la  tequete  dont  nous 
avons  d^j^  parl6.  Compldt6  par  des  modules 
particuliers,  il  devient  un  syst^me  multi¬ 
experts  pour  I'aide  d  i'indexation  et  it 
I'interrogation.  II  ram^ne  alors  la  question 
de  I'utilisateur  aux  notions  pn£sentes  dans  le 
thesaurus. 

En  d'autre  terme:  il  passe  de  la  notion  de  "mobllier 
contemporain’  It  celle  de  "meuble  modeme*.  de 
'mobilier  en  bois  massir  a  la  rubrique  ’meubles 
en  bois''.  Dans  un  autre  exemple,  il  retrouve  la 
notion  'd'augmenlation  de  salaire”,  que  celle-ci 
soit  pr£sente  sous  la  forme  de  'croissance  do 
SMIC,  ou  dans  "les  ricentes  majorations  des 
appointements  des  lechniciens...",  etc. 

Dans  ces  syst^mes,  par  I'utilisation 
de  dictionnaires  organises  comme  des  bases 
de  connaissances,  on  lepiie  un  certain 
nombre  de  concepts.  Ces  concepts  sont 
ensuite  utilises  dans  des  procMuies  de 
comparaison  ne  comportant  g^n^ntlement 
qu'une  seuic  itape  inKrentielle.  Mais, 
dans  une  deuxidme  cat6gorie  de  systime, 
on  observe  claiiement  renchainement  de 
plusieures  ^pes  infi^rentielles. 


Dans  le  systime  IOTA,  les  terms 
de  la  requite  primitive  sont 


progiessivement  transform6s  lots  d'un 
processus  de  reformulation.  D  s'agit  d'abord 
de  choisir  les  concepts  I  reformuler,  puis  de 
determiner  les  relations  sdniantiques  it 
utiliser,  enfm  de  fixer  le  niveau  de 
reformulation  ^  effectuer.  On  substitue 
progressivement  &  un  concept,  un  ou 
plusieurs  autres  concepts  localises  en 
utilisant  des  relations  semantiques.  On 
modifie  successivement  les  liens  entre  les 
concepts.  Un  point  important  consiste  alors 
i  determiner  et  i  cont&ler  reioignement 
autorise  par  rapport  au  concept  initial.  Dans 
ce  systSme,  reioignement  est  une  fonction 
du  niveau  de  connaissance  de  I'utilisateur. 
On  observe  une  realisation  du  meme  type  i 
travers  le  systeme  EXPRIM  de  David  et 
Crehange. 

Dans  le  systbme  IOTA,  la  question  ’papier’  est 
transforntde  en  ’supports  papier  n£cessaire$’  par 
deux  Stapes  de  reformulation.  Dans  le  systbme 
EXPRIM,  ’(photos  d')  enfanls  qui  font  de  la 
course  b  pied,  pour  la  revue  Mode  et  Travaux’ 
devient:  ’aspect  vestimentaire  et  lobir  des 
enfanls’  en  deux  ou  trois  dtapes. 

Le  systfeme  SPIRIT  IFluhr  et 
Radasoa  ]  notamment  dans  sa  version 
multilingue  [  EMIR  de  Fluhr  91  ],  utilise 
6galement  un  thesaurus  comme  base  de 
connaissance  d'un  domaine  spdcifique.  H 
s'agit  done  d'un  aspect  different  de  la 
reformulation  par  les  documents  pertinents. 
De  fagon  standard,  le  comparateur 
recherche  les  intersections  entre  les 
documents  de  la  base  et  la  question  en 
manipulant  des  mots  normalises  et  en 
utilisant  un  module  statisdque.  Dans  ce  cas, 
le  comparateur  du  systeme  exploite 
egalement  b  partir  de  chacun  des  termes  de 
la  question,  les  diff^rentes  relations 
possibles:  synonyme,  sp6cifique,  generique, 
traduction,  etc. 

Les  relances  successives  d'une 
recherche  s'accompagnent  done  d'une 
reformulation  automatique  de  la  requSte 
initiate.  Cette  reformulation  invoque  un 
thesaurus  •  on  base  de  connaissance 
specifique  du  domaine  -.  Mais  elle  s'appuie 
^galement  sur  des  transformations 


VK) 


linguisdques  de  type  morphologiques  ou 
syritagmatiques,  sur  I'utilisation  de 
connaissances  sdmantiques  contenues  dans 
un  dicdonnaiie  g^ndral  de  la  langue. 

La  nature  exacte  de  ce  processus 
d'inf6rence  depend  du  formalisme  adopts 
pour  repnfsenter  les  connaissances. 
S'agissant  de  rfseaux  sdmantiques  ou  de 
th6saurus,  des  operations  d'activation  par 
proximite  sont  generalement  utilisdes.  Les 
noeuds  et  les  arcs  representant  des  concepts 
du  domaine.  Une  fois  les  noeuds  de  ddpart 
activds,  I'activation  se  propage  vers  d'autres 
noeuds  en  suivants  les  liens  dtablis  et  en 
respectant  certaines  contraintes;  contraintes 
de  distance,  contraintes  de  branchement 
pour  un  trop  grand  nombre  d'aics, 
contraintes  valorisant  certains  chemins 
privildgids  en  fonction  de  iridta- 
connaissances,  etc. 

4.  Conclusion:  d'une  architecture  fondde 

sur  des  systdmes  multi-experts  vers  une 
architecture  nfo-connexioniste? 

On  voit  done  apparaitre  depots 
quelques  anndes  des  systdmes  conviviaux  et 
intelligents.  Os  tirent  parti  de  recherches 
cognitives  et  tendent  i  reproduire 
I'ensemble  du  comportement  d'un  expert 
documentaliste.  Lors  de  la  recherche 
d'informations,  ils  sont  guidds  par  des 
connaissances  stratdgiques.  Ils  ptennent  en 
compte  une  moddlisation  des 
comportements  de  I'udlisateur.  Ils  disposent 
gdndraleinent  d'interfaces  en  langage 
naturel  qui  s'appuient  sur  des  procedures 
d'analyse  linguistique.  Ils  incorporent, 
d'une  fa^on  ou  d'une  autre,  des 
connaissances  d'un  spdcialiste  du  domaine. 
Ds  regroupent  ainsi,  dans  un  montage 
complexe,  diffdrents  modules  simulant 
I'intervention  d'intermddiaires  humains: 
documentaliste,  linguiste  et  analyste 
(cogniticien). 

Aux  Etats  Unis  de  bons 
reprdsentants  de  cette  nouvelle  gdndration 
de  sysidmes  sont  CX)DER  de  Fox  ou  UR  de 
Croft  CODER  met  en  oeuvre  des  experts, 
pour  la  construction  des  moddles  de 


I'utilisateur  et  de  la  requete,  pour  I'analyse 
et  I'indexation,  pour  le  choix  de  diffdrentes 
proeddures  de  Election  et  pour  I'utilisation 
du  thdsaurus.  Ces  experts  communiquent 
par  un  tableau  noir.  I3R  dispose  d'experts, 
pour  construire  un  moddle  de  I'utilisateur, 
pour  dtablir  un  moddle  de  la  requete,  pour 
le  choix  des  proeddures  de  sdlection 
(moddle  probabiliste,  techniques  de 
clusterisation,  navigation ),  pour  infdrer  A 
partir  de  la  base  de  connaissances  les 
concepts  relids  d  la  requdte  initiale.  Un 
contrdleur  rdgle  I'activation  des  experts  en 
utilisant  un  plan  et  un  agenda. 

En  France,  on  peut  par  exemple 
citer  DIALECT  [  Bassano  et  Mekaouche]; 
SPIRIT  [Fluhr  et  Radasoa]  ou  IOTA  [ 
Chiaramella  et  Defude].  L'interface 
DIALECT  comprend  deux  experts  stratdges 
contrdlant  sept  experts  spdcialistes.  Le 
premier  multi-experts  stratdge  est 
responsable  de  I'analyse  de  la  requete 
initiale  et  des  textes.  Le  second  est 
responsable  de  la  reformulation  progressive 
de  la  requdtc.  Mais  DIALECT2  dispose 
dgalement  des  experts  pour  I'indexation, 
pour  la  sdlection  et  pour  la  reformulation  de 
SPIRIT.  Dans  une  nouvelle  version  de 
SPIRIT,  I'ensemble  des  connaissances  sur 
la  reformulation  a  dtd  reprdsentd  de  manidre 
homogdne  par  des  rdgles  de  production.  Ces 
regies  prennent  en  compte  aussi  bien  des 
connaissances  linguistiques  (families  de 
mots)  que  des  connaissances  sur  le  domaine 
( rdgles  de  type  thesaurus ).  Le 
ddclenchement  de  certains  sous-ensembles 
de  rdgles  peut  dtre  rdalisd  au  moyen  de 
mdta-rdgles  en  cours  d'dlaboration. 

Dans  IOTA,  le  systdme  expert  utilisd  est  un 
systdme  d  base  de  rdgles  de  production 
permettant  I'appel  de  proeddures  extemes 
en  partie  droite  des  rdgles.  A  travers  sa  base 
de  donndes  d  court  terme  ( tableau  noir ),  le 
systdme  expert  gdre  la  communication  entre 
(Uffdrentes  composantes  qui  peuvent 
dventuellement  dtie  elles-mSmes  qualifides 
"d'expeits";  gesdon  de  la  base  de  textes, 
gestion  d'un  lexique  de  la  langue  et  de 
idgles  d'analyse  linguistique,  gestion  d'un 
thesaurus.  Les  proeddures  d'analyse 


linguistique  constituent  elles-metnc  un 
montage  multi-expens. 

On  remarque  que  les  syst&mes 
rdcents  component  un  nombie  de  plus  en 
plus  consequent  d'expens.  On  remplace  les 
realisations  traditionnelles,  monolithiques 
et  procdduTales,  par  un  ensemble  de  “petits" 
experts  specialises.  On  reporte  les 
difficultes  de  fonctionnement  sur  la 
communication  et  le  contrdle  du  dialogue 
entre  les  experts.  U  devient  done  de  plus  en 
plus  difficile  de  faire  coopdrer  et  de  mettre 
en  place  I'ensemble  de  ces  specialistes.  Or, 
un  nouveau  courant  "neo-connexioniste, 
explorant  la  simulation  de  rdseaux 
neuronaux  par  machine,  inspire  ddj^ 
quelques  recherches  en  informatique 
documentaire.  Une  proposition  intdressante 
repose  sur  la  gestion  de  tous  ces  specialistes 
par  une  methode  analogue  h  celle  utilisee 
dans  les  architectures  connexionnistes  [ 
Desroeques  90  ].  On  sugghre  alors  une  sorte 
d'hybridation  entre  les  systdmes  multi¬ 
experts  et  les  systdmes  neuro-mimetiques. 

A  rarchitecture  multi-experts,  on  emprunte 
des  "experts"  de  taille  trds  restreinte  dont 
les  connaissances  sont  foumies  par  des 
spdcialistes  ( linguistes,  documentalistes, 
etc.).  Des  architectures  connexionnistes,  on 
letient  les  possibilites  d'appientissage  et  de 
gestion  efHcace  d'un  grand  nombie  de 
traits.  La  rdussite  de  ces  realisations 
construites  en  partie  autour  des  moddles 
connexionnistes  est  probable:  une 
convergence  certaine  relie  cette  approche 
aux  methodes  statistiques  traditionnelles  et 
aux  techniques  de  reformulation  et 
d'infdrence  exposdes  dans  cet  dtat  de  I'art.. 
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SEARCH  STRATEGIES  IM  NATURAL  LANGUAGE 
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2509  LS  the  Hague.  NL 

Stamart 

Uter  a  dlscusaioa  of  ooUne  aearchlog  prdileas  aoae 
aetbodi  for  aaklog  anllne  aearcUng  eaaj  foe  eod-uaers 
aca  dascrlhed  :  Intalligant  gateuays.  ZOOH,  HfPQILDIE. 
CD-KM  and  KUOS,  tttaotlao  is  givn  to  paralog  and 
natural  language  Interf aces  to  databases  and  tfaea  natural 
language  projects  such  as  cm.  OUPI.  PLEZUS/ICHE/Kin. 
D1UE60IH/IIL.  DeiS/SnST.  SFUaT/HfU  and  Upba  DIIO 
are  descrlM.  In  the  fourth  chapter  atteotico  is  given 
to  Itatural  Language  and  Thesauri,  sudi  as  the  bilingual 
RATO  Thesaurus.  A  bibliography  has  been  added.  In  an 
annex  an  exaivile  dxm  that  End-users  can  start  online 
searching  Kith  latural  Language  terns,  using  ZOCR  and 
HTPERLIRE  omands. 


1.  THE  PROBUM 


The  problems  with  online  searching 
(bibliographic)  databases  in  their 
native  mode  on  the  commercial  and 
governmental  database  vendor  systems 
Include  : 

What  relevant  databases  exist 
(which  databases) 

How  do  I  access  them 

(which  host)  (S0RM89) 

How  do  I  retrieve  information  from 
them  (MISC87) 

(which  search  terms) 

(which  search  strategy) 

What  can  I  do  with  the  retrieved 
information 

(postprocessing)  (C0TT88) 


1.1  HHICH  DATABASES 


Each  hostcomputer  has  a  guide  which 
gives  information  about  the  databases 
that  are  available.  Dialog  has  a  Data¬ 
base  Catalog,  which  mentions  7  databases 
about  Education  within  the  category 
Social  Sciences  and  Humanities.  But  the 
Easynet  Database  Directory  of  Tele- 
systwis  mentions  22  databases  about 
Education,  although  PsychINFO  and  social 
Scisearch  eure  not  mentioned.  And  Ted 
Brandhorst  of  the  ERIC  Processing  and 
Reference  Facility  mentions  even  more 
datediases  about  Education  (BRAN90). 
The  I'M  Guide  of  the  CEC  IMPACT  Program 
lists  77  databases  in  the  Category 
Education. 


But  when  you  have  a  question  about  a 
multidisciplinary  subject  such  as 
"  I  nteract  ive  V  ideodisc  for  Educat  ion  and 
Training",  some  of  these  Education 
databases  will  not  have  information 
about  the  subject  and  databases  that 
are  not  mentioned  may  have  valuable 
information.  In  that  case  online  Data¬ 
base  directories  such  as  Dialog  DIAL- 
INDEX  and  ESA  QUESTINDEX  may  help. 

DIALINDEX  (SF  engineering) 

interact ive(w) videodisc 
INSPEC  322 

NTIS  89 

COMPENDEX  PLUS  71 

These  database  directories  are 
hostspecific.  When  you  want  to  know 
everything  about  a  subject,  you  have  to 
use  all  the  database  directories  of  the 
hosts  to  which  you  have  access. 

The  I'M  Guide  database  of  ECHO,  which 
contains  information  about  some  1500 
European  databases  isn't  hostspecific, 
but  is  of  limited  use  for  a  multidis¬ 
ciplinary  question.  The  printed 
I'M  Guide  only  gives  information  at  the 
level  of  Dialog  Bluesheets,  which 
describe  the  format  of  the  databases. 
A  better  tool  might  be  the  Online  Manual 
of  Blackwell  Publishers,  which  contains 
a  new  and  powerful  keyword  Thesaurus, 
designed  to  enable  any  user  to  find  the 
most  useful  database  to  search  in  any 
given  subject  (C0US91). 

But  for  a  multidisciplinary  question  a 
Current  Contents  based  Database 
Directory  (CCDD)  might  be  a  solution. 


1.2  NHICB  HOST 


In  general  you  can  get  access  to 
databases  on  hostcenputers  if  you  have 
a  PC,  a  modem,  conminicatlon  software 
and  contracts  with  several  hostccxgpu- 
ters.  Unfortunately  each  hostcomputer 
))as  its  own  comnand  language  and  a 
complicated  logon  procedure  (Tid>le  I). 

CCL 

In  1979  the  CEC  develcped  a  standardized 
comBon  comnand  language  (CCL)  as  a  tool 
to  inprove  human  utilization  of 
computer-based  information  systems. 
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Unfortunately  the  hostcon^uters 
preferred  to  continue  using  their  own 
coimnand  languages. 

It  is  very  useful  to  have  an  autodial 
modem  which  can  store  the  logon 
procedure.  Somehostconputers  have  their 
own  communication  software,  such  as 
Oialink  of  Dialog  and  Mikrotel  of  ESA. 
Some  hostcomputers  have  a  gateway  to 
another  hostconputer  ;  When  you  have 
access  to  the  ESA  hostcomputer,  you  can 
also  search  databases  on  the  Profile 
computer.  But  when  you  want  to  have 
access  to  databases  on  more  than  3 
hostcomputers,  with  different  ccsimand 
languages,  it  might  be  useful  to  have 
an  (intelligent)  gateway  or  intelligent 
interface. 

(Intelligent)  gateways  can  help 
untrained  end-users  and  information 
specialists  to  get  access  to  bibliograp¬ 
hic  databases.  Several  possibilities 
exist,  which  depend  on  which  computer 
the  gateway  is  situated  (hostconputer, 
special  gateway  conputer  or  PC  of  end- 
user)  and  the  degree  of  intelligence  or 
help  (autodial  teleconmunlcation, 
command  translation  for  different 
hostcomputers,  database  and  host 
selection,  terminology  and  search 
strategy  help,  search  reformulation. 
(HILr.86)  (BO0M90)  (EFTHgO).  (Table  II) 

The  Easynet  system  of  Telebase,  which 
gives  easy  access  to  seme  900  scien¬ 
tific,  technical  and  business  databas¬ 
es  on  12  hostcomputers  has  been  a 
commercial  success  since  1984.  In  this 
case  you  don't  have  problems  with  the 
conmand  languages  and  the  logon 
procedures  of  the  hostconputers.  You  do 
not  need  contracts  with  hostcemputers , 
you  only  get  a  bill  of  Easynet. 
Intelligent  Information  (II)  of  Infotap 
which  was  developed  later  in  Europe  is 
giving  easy  access  to  business 
databases.  Infotap  is  also  developing 
TOOTSI,  a  tool-box  for  developing 
intelligent  gateways,  which  will  be 
available  in  1992  (HAHO90). 

Although  there  isn't  much  information 
about  how  Easynet  processes  a  question, 
the  Easynet  system  might  be  based  on  the 
ideas  that  were  developed  in  1975-1980 
by  Richard  Marcus  (COHIT)  and  Victor 
Haspel  (TIS). 


The  ESURS  system  for  automatic 
translation  of  command  languages  of  the 
Dialog,  Datastar,  Fiz-TechnikandGenios 
hostcomputers  is  based  on  lexical 
analysis,  syntactic  analysis  (top-down 
parsing),  semantic  analysis  and  a 
generating  proces  (ZB0R91). 

The  major  problem  with  intelligent 
interfaces  is  that,  like  the  host  menu- 
systems,  they  offer  insufficient 
assistance  with  term  selection  and 
strategy  construction.  Sisple  searches 
may  produce  good  results,  but  complex 
searches  will  probably  be  less 
satisfactorily  resolved.  Much  will  still 
depend  upon  the  user's  skill  in 
selecting  the  right  terms  and  using  them 
correctly  (UiRGBO). 


1.3  TERNIHOIiOGy  AMD  SEARCH  STRATEGY 


When  you  want  to  retrieve  information 
about  a  certain  subject  from  a  database 
you  have  to  use  the  Keywords  from  the 
Controlled  Vocabulary  or  Systematic 
Thesaurus  of  the  Database.  That's  why 
an  Information  Centre  normally  has 
Thesauri  and  Classifications  of  all 
databases  that  are  used  regularly  : 
NASA,  INSPEC,  ERIC,  Psychinfo,  DTIC, 
TEST  (for  NATO-PCO  database). 

To  retrieve  information  about  a  complex 
subject,  the  query  has  to  be  analysed 
and  translated  into  keywords  that  have 
to  be  combined  in  a  search  strategy  or 
search  profile  (S0RM89)  (BATE89).  An 
overview  of  the  various  types  of  search 
strategies  has  been  given  by  Harter 
(HART86). 

BUILDING  BLOCK  STRATEGY 

This  is  the  most  commonly  used  overall 
approach.  A  search  profile  is  created 
through  four  steps  : 

-  Identify  major  concepts  or  facets  and 
their  logical  relationships 

-  Identify  search  strings  that  represent 
the  the  concepts  (words,  phrases, 
descriptors ,identif lers ,classif ica- 
tion  codes)  and  fields  to  be  searched. 

-  Create  a  set  for  each  concept  or  facet 
by  combining  the  research  strings  of 
a  concept  using  Booleem  operator  OR 
(union) 
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-  Create  a  result  set  by  combining  the 
facet  (concept)  sets  with  Boolean  AND 
(intersection)  (NOT  or  OR  rarely 
used) 

An  exaitple  of  a  search  has  been  taken 
from  a  report  about  "Interactive 
Videodisc  Instruction"  (FLET90). 

Only  DTIC,  ERIC  and  Psychinfo  databases 
were  searched,  using  all  combinations 
of  the  following  : 

Block  1  AND  Block  2  AND  Block  3 

Con?)uter  Assisted  Education 

OR  OR  OR 

Videodisc  Aided  Learning 

Mediated  Training 
Managed 
Based 

Additionally,  the  following  terms  were 
used  by  themselves  : 

Interactive  Videodisc,  Interactive 
Video,  Interactive  Courseware 

CITATION  INDEXING  STRATEGY 


This  strategy  is  only  applicable  in  the 
Science  Citation  Index  and  other 
databases  based  on  citation  indexes .  The 
simplest  strategy  is  to  identify  a 
"classic"  highly  relevant  paper  and  to 
retrieve  all  the  documents  that  have 
cited  this  document.  Also  the  name  of 
a  cited  author  or  the  names  of  cocited 
authors  can  be  used  as  a  search 
criterion. 


CONFERENCE  CITATION  INDEXING  STRATEGY 

When  you  can  find  a  good  conference 
about  a  certain  subject,  you  have  found 
key-authors  of  this  subject.  And  then 
you  can  search  all  relevant  databases 
using  the  names  of  these  key-authors. 
Cited  references  to  their  articles  can 
be  found  in  the  Science  Citation  Index. 


1.3.1  HHICH  SBilRCH  TERMS 


The  Easynet  Database  Directory  of 
Telesystems  shows  that  22  databases  have 
Information  about  Education.  When  all 
available  information  concerning  a 
certain  subject  is  needed  for  a  new 
research-project,  several  of  these  22 


databases  should  be  searched.  Dialog 
Onesearch  or  ESA  Clustersearch  or  the 
Datastar  Starsearch  option  can  be  used, 
which  allows  for  multiple  database 
searching  with  one  strategy.  Generally 
for  Onesearch  and  Clustersearch  long 
search  strategies  have  to  be  used, 
because  each  database  has  its  own 
Thesaurus,  and  relevant  descriptors  of 
all  relevant  databases  should  be 
included.  A  descriptor  of  a  certain 
datEtbase  will  be  a  free  term  (Natural 
Language  term)  for  another  database 
(FIDE86).  Alternatively,  a  strategy 
translation  step  between  systems  is 
needed . 

ZOOM  AND  HYPERLINE 

ESA-QUEST  has  an  unique  facility  called 
ZOOM,  which  helps  you  to  find  useful 
descriptors .  It  is  based  on  statistical 
characterixation  of  terms  associated 
with  specified  document  sets.  This  is 
in  general  a  Frequency  Analysis  of  terms 
in  various  fields  of  the  retrieved 
documents.  ZOOM  thus  provides  data  for 
establishing  statistical  relations  among 
terms ,  and  between  documents  and  terms . 
The  information  derived  from  2XX)M,  in 
combination  with  general  statistics  of 
the  database,  can  be  used  for 
Probabilistic  Relevance  Feedback  and  in 
other  semi-automatic  feedback  modes 
(BELK90). 

If  the  term  the  user  is  entering  is  not 
in  the  thesaurus,  then  a  sample  of  the 
documents  containing  that  term  is 
examined.  The  controlled  keywords 
(Descriptors  or  controlled  terms)  of  the 
documents  in  the  sanple  are  ranked 
according  to  their  frequency  in  the 
sanple  of  documents  using  the  ZOCM 
ccxmand  ( INGH84 ) .  Then  you  can 
reformulate  the  question. 

ZOOM  s  interactive(w)videodisc 


Inspec  NTIS 

abstr  tit  title 


nundaer  of  hits  252 

163 

35 

ESA  ZOOM  on 

50 

50 

35 

video  and  audiodisc 

37 

44 

videodisks 

12 

videorecording 

6 

interactive  systems 

31 

39 

4 

interactive  videodisc 

21 

15 

interactive  video 

7 

5 

21 

interactive  videodisc 

technology 

5 
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The  ZCX)M  Comnand  has  been  used  for  the 
HYPERLINE  Searching  aid,  which  contains 
a  Browse  Thesaurus  option  (Semantic 
Association  Method)  and  a  Navigation 
method.  In  the  Browse  Thesaurus  option 
the  top  five  controlled  keywords  that 
are  also  Thesaurus  entries  are  shown  to 
the  user.  Phrases  (multiple  terms)  are 
accepted  in  input,  but  no  real  Natural 
Language  Processing  (NLP)  is  performed 
on  them.  This  means  that  no  verbs  should 
be  used  and  that  (at  least  at  the  stage 
of  the  project)  only  a  simple 
preprocessing  is  performed.  In  this 
preprocessing  stage  terras  like  "and"  and 
"or"  are  treated  as  logical  ANDs  ar  ORs 
and  the  term  "in"  is  transformed  in  a 
logical  "AND".  The  kind  of  phrases  the 
user  can  input  are  then  analogous  to  the 
one  the  user  can  search  for  via  the  CCL 
command  "Find"  (AGOS91). 

Search  term  : 

SGML 

HYPERLINE  BROWSE 

Related  Thesaurus  terms  : 

Electronic  Publishing 

Standards 

Word  Processing 

Desktop  Publishing 

Electronic  Data  Interchange 

Search  term  : 

Electronic  Data  Interchange 

HYPERLINE  NAVIGATION  : 

4994  :  Data  Handling 
993  :  EFTS 
5974  :  CAD/CAM 
763  :  Integrated  Software 
39674  :  Standards 
109  :  EDIF 

ZOOM  and  HYPERLINE  have  much  to  do  with 
the  quorum  function  that  was  proposed 
by  Cleverdon  (CLEV84)  and  which  was 
inqilemented  on  a  trial  basis  by  ESA- 
IRS  in  1985  as  Questquorum  (HART90). 

BRS  TERM  AND  SUPERTHESAURUS 

But  if  all  the  cooperating  systems  used 
a  connon  set  of  search  keys  (descri¬ 
ptors)  frcm  a  single  generic  (faceted) 
Thesaurus  the  translation  step  between 
systems  could  be  avoided  (VICK75). 


The  uncoordinated  growth  of  databases 
is  an  impediment  to  online  searching. 
Standardization  and  coordination  are 
required  if  users  are  to  be  able  to 
fully  exploit  the  capabilities  of  on¬ 
line  systems  (FIDE88).  Standardization 
and  inprovement  in  Subject  Access  to 
recorded  information  worldwide  must  be 
acconplished  before  Expert  systems  can 
be  truely  useful  for  information 
retrieval  (ALBE90) 

The  BRS  hostcon5>uter  has  a  TERM 
database ,  which  contains  merged  Thesauri 
from  several  fields  (Education, 
Psychology,  andMedicine, among others) , 
as  well  as  Natural  Language  terras 
suggested  by  practicing  searchers.  Even 
suggested  Boolean  combinations  are 
included.  When  a  searcher  inputs  a  term 
or  phrase  into  the  BRS  TERM  datcibase, 
an  entry  is  printed  out  listing  possible 
alternative  terms  that  may  be  up  to 
several  dozen  lines  long. 

If  this  database  were  expanded  and 
enriched  as  a  front-end  database,  or  a 
Superthesaurus,  it  could  contain  an 
enormous  variety  of  useful  entry  terms, 
with  all  sorts  of  guidance  to  decide  on 
the  best  terms  for  a  given  search 
(BATE89). 


1.3.2  WHICH  SEARCH  STRATEGY 

BOOLEAN  OPEARATORS 

During  online  searching  Boolean 
operators  AND,  OR,  NOT  are  used  to 
combine  the  different  aspects/concepts 
of  the  question  (postcoordination  of 
descriptors).  There  is  much  conment  on 
the  difficulty  people  have  with  Boolean 
operators  in  database  querying  ( THCM89 ) , 
(ESSE91).  Borgman  found  that  25  %  of 
subjects  learning  an  SQL-like  query 
language  could  not  pass  benchmark  tests 
for  system  proficiency,  although  these 
tests  were  representative  of  the 
searches  that  were  supported.  The 
problem  seemed  to  lie  in  the  use  of 
Boolean  logic  :  more  than  one  quarter 
of  the  subjects  could  not  coiif>lete 
siaple  search  tasks  involving  the  use 
of  one  index  and  at  most  one  Boolean 
operator  (BORG86). 


4-6 


One  aspect  of  this  problem  might  be  the 
difference  between  natural  English 
usage  of  the  words  "and”  and  “or"  and 
their  use  as  Boolean  operators  in 
retrieval.  The  English  word  "or"  is  most 
frequently  used  to  indicate  union,  but 
"and"  is  often  used  ambiguously  to 
indicate  both  union  emd  intersection. 
The  other  aspect  might  be  that  the 
subjects  do  not  understand  the  use  of 
Boolean  operators  in  subset  specificati¬ 
on. 

It  seems  that  Boolean  operators  are  the 
biggest  problem  in  End-User  online 
searching.  Ulla  de  Strieker  even  doubts 
if  a  menu  Interface  can  be  designed 
which  will  mediate  between  the  "online 
innocent"  End-user  and  one  or  more 
online  services  (STRI88).  But  a  menu 
interface  is  used  in  Urbana  to  construct 
a  Boolean  search  strategy  (MISC89). 

Frants  and  Shapiro  describe  an  algorithm 
that  automatically  constructs  a  query. 
The  algorithm  uses  a  set  of  documents 
that  the  user  found  pertinent  to  his 
information  need.  The  descriptors  which 
were  assigned  to  these  documents  are 
analyzed,  and  the  most  in^jortant 
descriptors  are  identified.  Then  the 
Boolean  query  formulation  is  drawn  up, 
consisting  of  subrequests,  of  which  the 
calculated  values  all  exceed  a  certain 
bound  value .  This  bound  value  determines 
the  quality  of  the  query  formulation 
(FRAN91). 

lUXatOE  (SMIXIIIG}  0FBIIATC8  AMD 

gOERT-BT-EXMlPLE 

Instead  of  the  "difficult"  Boolean 
operators  a  more  user-friendly  ACCRUE 
operator  is  being  used  in  Topic,  a  full- 
text  retrieval  softweure,  which  is  based 
on  the  RUBRIC  concept-tree,  a  faceted 
cluster  of  50  terms  (T0NG85)  .  The  ACCRUE 
curator  works  as  follows.  Suppose  you 
want  to  search  with  4  keywords  A,B,C,D 
in  a  database  of  abstracts  which  contain 
4,  3,  2,  1  or  0  of  the  4  keywords. 
Kith  the  ACCRUE  operator  titles  related 
to  abstracts  that  contain  4  keywords 
come  on  top  of  a  list,  then  abstracts 
that  contain  3  keywords  and  so  on. 

The  titles  having  the  same  number  of 
keywords  in  the  abstract  are  ranked 
according  an  individual  weight  factor 
for  each  keyword.  The  End-User  does  not 
have  to  think  about  Boolean  operators. 


the  ACCRUE  operator  will  get  the  most 
relevant  titles  on  top  of  the  list. 

The  ACCRUE  operator  can  be  used  for 
Query-by-Exanple.  In  this  case  you  start 
a  search  with  an  abstret  of  a  known  very 
relevant  article.  You  just  click  some 
5  relevant  words  in  the  cibstract  and 
then  the  ACCRUE  operator  will  find 
similar  articles,  with  the  titles  of  the 
most  relevant  articles  on  the  top  of 
the  list. 

RELEVANCE  FEEDBACK  AND  HAIS 

Instead  of  Boolean  operators  a  new 
Relevance  Feedback  concept  has  been 
developed  as  retrieval  method  for 
terabyte  databases  for  the  Connection 
Machine  supercomputer  of  Thinking 
Machines  (KAHL86).  The  Relevance 
Feedback  concept  has  been  used  as 
DONQUEST  for  the  Dow  Jones  News  database 
on  a  CM2  machine.  First  a  seed  search 
is  made  by  entering  all  relevant  words, 
then  the  say  10  most  relevant  articles 
of  100  are  marked  and  then  a  search  is 
made  for  "similar"  articles.  The 
Relevance  Feedback  concept  has  also  been 
used  for  Wide  Area  Information  Servers 
(WAIS).  HAIS  is  a  standard  information 
exchange  protocol  that  offers  unlimited 
connectivity  and  retrieval  functionali¬ 
ty  (KAHL91). 

PROXIMITY  OPERATORS  AND  RELATIONAL 
KEYWORDS 

Powerful  proximity  operators  such  as 
"adjacent"  ADJ  or  (Ow)  can  be  used  for 
Natural  Language  searching  for  non¬ 
descriptors  in  Full-text  retrieval 
systems  :  Intelligent  ADJ  Information 
ADJ  Retrieval. 

Because  words  like  with,  of,  frexn  etc. 
are  on  a  stop  list,  you  can  not  use 
proximity  operators  for  those  words.  In 
this  case  you  can  use  relational 
keywords  (H0GI89). 

Software-development-wi-Conputer- 

graphics 

Spec  if  icat  ions  -of  -Sof  tware 
Aircraft-vs-Air-Defence 
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2.  CD-ROI  AMD  MENDS 


In  1985  Philips  and  Sony  published  the 
"Yellow  Book",  a  loose  standard  for 
using  the  audio  CD  for  digital  data. 
This  allows  to  store  some  600  Mb  of  text 
information  on  a  cc*i5>act  disc.  The  disc 
thus  became  a  Read-Only  Memory  which  can 
hold  some  200.000  pages  of  text. 

In  19B5  Silver  Platter,  which  was 
founded  by  Bela  Hatvany,  presented  the 
first  commercial  CD-ROM  of flinedatabase 
product.  Since  1985  CD-ROM  products  have 
become  a  success ,  because  they  generally 
are  easier  to  use  than  online  databases 
and  because  you  have  "unlimited”  access 
time,  once  you  have  a  CD-ROM  of  a 
certain  database.  In  1991  some  2000 
commercial  CD-ROM  products  exist. 

The  NTIS  Ondisc  of  CD-ROM  Market-leader 
Dialog  is  a  most  useful  CD-ROM  product 
in  Aerospace  R&D,  because  it  contains 
much  information  cibout  Defense  Research 
reports  (AD-A  nrs)  and  NASA  research 
reports  (N  nrs). 

It  has  an  Easy  Menu  mode  which  is  really 
user-friendly  and  a  Dialog  Coimnand  Mode. 
In  the  Easy  Menu  mode  first  you  enter 
terms  (OR)  and  then  you  limit  (AND). 

A  search  strategy  can  be  saved  and 
executed  online,  even  Easy  Menu 
strategies.  This  permits  the  End-User 
to  search  online.  Several  Easy  Menu 
Conanands  can  be  used  in  the  Command 
Mode. 

In  the  display  mode  of  the  Dialog  Ondisc 
sorting  on  Word  Frequency  allows  for 
Ranking  of  the  "Best"  on  top,  even  when 
the  search  strategy  is  clumsy.  This  is 
an  useful  option  for  untrained  End- 
Users,  who  can  also  use  the  Dialog 
Ondisc  for  training  Online  searching. 
Dialog  produced  their  first  CD-ROM  in 
1987,  the  ERIC  Ondisc,  and  now  offers 
35  CD-ROM  products. 

But  some  other  CD-ROM  products  also  have 
nice  things. 

-  NATO-PCO  CD-ROM  has  an  user-friendly 
Form  Filling  mode 

-  Silver  Platter  products  have  a  F2 
Functionkey  which  you  can  use  to 
transfer  words  from  the  abstract  to 
the  Search  function. 

-  NllsOndisc  of  Jelled  Sciences  and 
Tecduiology  has  a  rather  good  Thesaurus 


Unfortunately  each  CD-ROM  vendor  uses 
his  own  software.  In  this  case  you  can 
decide  to  standardize  on  1  vendor  who 
offers  several  CD-ROM  databases.  In  a 
big  information  centre  with  several 
Information  Specialists  you  can  also 
decide  that  Information  Specialist  1 
makes  use  of  Dialog  CD-ROM's  and  nr  2 
makes  use  of  Silver  Platter  CD-ROM's. 
But  in  the  case  of  a  small  Information 
Centre  with  1  Information  Specialist  for 
all  kinds  of  databases  you  would  still 
need  an  intelligent  interface. 

In  September  1991  Silver  Platter 
announced  to  split  their  SPIRS  software 
in  a  Database  Search  Engine  ( server)  and 
an  Interface  (client),  according  to  the 
recommandations  of  the  NISO  Standards 
Committee  (This  would  maybe  allow  users 
to  search  Silver  Platter  CD-ROM  products 
with  the  Dialog  retrieval  software,  but 
maybe  less  information  can  be  stored). 
Silver  Platter  is  also  working  on  the 
Electronic  Reference  Library  (ERL)  for 
Macintosli  and  Sun  computers. 

Standards  are  being  developed  for  : 

-  Information  Retrieval  Protocol 
(Z39.50) 

-  Structured  Full-Text  Query  Language 
(Aerospace  Industry) 

-  CD-ROM  Read-only  Data  Exchange 

In  September  1991  Dialog  announced  the 
CD-ROM  version  of  the  Bluesheets  which 
contains  informa* ion  about  the  format 
of  the  online  databases .  The  Bluesheets 
OnDisc  can  also  be  used  as  a  simple 
database  directory. 

But  at  this  moment  CD-ROM' s  do  not  have 
a  good  frequency  analysis  such  as  the 
ZOOM  command  of  ESA. 

NATURAL  LANGUAGE  SEARCHING  ON  CD-ROM 

Because  titles  of  articles  contain 
Natural  Language  and  are  searchable  on 
a  CD-ROM,  you  can  start  a  search  with 
a  part  of  the  title  of  a  good  article. 

GLOBAL  CD-ROM  :  EDUCATION 

In  case  intelligent  gateways  will  give 
easy  access  to  databases  in  the  whole 
wide  world,  a  global  information  system 
will  appear  (KRAN89)  (RICH89). 

&it  instead  of  intelligent  gateways  to 
all  the  databases  about  Education, 
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it  might  be  easier  to  combine  all  the 
Education  information  from  all  relevant 
databases  on  1  single  global  Education 
CD-ROM  that  contains  max  5  years,  yearly 
updated . 

(The  ERIC  CD-ROM  now  is  max  10  years) 
MENUS  : 

The  creation  of  new  business  information 
services  by  CompuServe  and  Dow  Jones  has 
been  an  important  factor  in  the  world¬ 
wide  market  for  business  information. 
Their  strategy  was  to  target  the  End- 
user  by  offering  easy-to-use  software 
and  an  emphasis  on  business  news .  Dialog 
attempted  to  get  into  the  game  by 
offering  the  Business  Connection,  an 
easy-to-use  interface  that  stands 
between  the  user  and  Dialog's  more 
complex  command  driven  software 
(BREM91) . 

DMC 

The  Dialog  Medical  Connection  (DMC) 
gives  menu-driven  access  to  30  databases 
such  as  Biosis,  Medline,  Scisearch  and 
Excerpta  Hedica.  When  91  biologists  and 
chemists  of  the  Biology  Directorate  of 
Glaxo  Group  Research  Ltd  got  access  to 
the  DMC,  the  number  of  searches 
increased  200  %.  The  average  End-User 
carries  out  4  searches  per  month, 
spending  1  hour  and  some  $  60  (BOYD90). 
Of  4 alternatives (Searcher,  IT,  Easynet 
with  its  high  pricing  structure,  DMC) 
DMC  matches  most  of  the  requirements  of 
the  End-Users  and  is  tailored  to  the 
needs  cf  biomedical  searchers .  DMC  is 
an  acceptable  and  effective  system  for 
End-User  searching  for  biologists  and 
chemists  within  Glaxo  Group  (BYS090). 

The  success  of  DMC  and  CD-ROM  Menus 
have  been  a  driving  force  for  Host 
computers  to  make  their  databases  also 
available  through  Menus. 

Several  Hostcomputers  now  have  come  with 
Menus  for  Easy  online  access,  but  users 
might  prefer  a  Common  Menu  Mode  (CMM) 
for  both  CD-ROM  Menus  and  Online  Menus. 


3.  EXPERT  SYSTEMS  AMD  NATURAL  LANGUAGE 


Expert  Systems  have  great  potential  for 
Information  Retrieval.  Expert  Systems 
contain  Knowledge  Bases  (Thesauri  etc. ) 


and  Rules.  In  case  Expert  Systems  will 
be  used  to  help  End-Users,  the  rules 
which  operate  on  the  Knowledge  Base 
should  be  analogous  to  the  decision 
rules  or  work  patterns  of  the 
Information  Specialist (SHOV85) .  Expert 
systems  can  also  help  intermediaries. 
But  end-users  and  intermediary  interface 
needs  are  very  different  (COTT87).  Or, 
where  should  the  person  stop  and  the 
Information  Specialist  search  interface 
start  (BATE90). 

A  truly  intelligent  computer  interface 
would  be  one,  then,  which  would  perform 
all  of  the  functions  that  an  intelli¬ 
gent,  successful!  human  intermediary 
does,  which  are  necessary  and  sufficient 
for  successful  information  retrieval 
system  performance. 

Furthermore,  it  would  be  necessary  for 
these  functions  to  be  performed  in  an 
appropriate  interactive  dialogue, 
perhaps  also  based  on  that  which  is 
performed  in  human-human  information 
interaction  (BELK86). 

Because  Du  Pont  has  discovered  that 
expert  systems  can  do  80  %  or  more  of 
the  decision  work  of  experts,  it  may  be 
similarly  expected  that  expert  systems 
could  locate  80  %  or  more  of  the 
information  required  by  an  interrogator 
(FEIG88). 

As  much  as  80  %  patron  information  needs 
will  likely  be  satisfied  by  the 
intelligent  interface.  The  remaining  20 
%,  the  difficult,  unusual,  hard-to- 
pin-down  questions,  will  be  referred  to 
human  intermediaries  (KRAN89). 

USER  MODEL 

Before  you  start  an  online  search,  you 
have  to  decide  how  much  money  will  be 
spent  on  the  question  of  a  requester. 
You  have  to  know  how  in^xjrtant  the 
question  is. 

Will  the  requester  be  happy  when  he  can 
select  5  relevant  articles  from  a  list 
of  10  titles  or  does  he  need  to  know 
everything  from  the  whole  wide  world 
concerning  the  subject  of  a  new  R&D 
contract . 

3  types  of  online  users  are  :  Novice 
End-users  (NE) ,  Subject  experts  (SE)  and 
Online  Experts  (OE).  Other  types  are 
described  in  literature  about  "User 
Behaviour"  (BELK85)  or  "User  Modeling" . 
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Importance  $  1 . 

.000 

10.000 

100.000 

Search  budget  $ 

10 

100 

1.000 

Document  deliv  $ 

10 

100 

1.000 

Searcher 

NE 

SE 

OE 

training  hours 

1 

10 

100 

needed  help 

which  database  +  +  + 


which  terms 


+  + 


which  strategy  + 


This  might  mean  that  an  intelligent 
interface  for  80  %  of  the  questions 
should  mainly  help  Novice  End-users  (NE) 
and  Subject  Experts  (SE). 

In  the  case  of  the  Novice  End-User,  the 
interface  should  be  able  to  accept 
questions  in  Natural  Language  such  as 
"Do  you  have  something  about  XYZ", 
remove  "do  you  have  something  about", 
split  XYZ  in  components  according  the 
Building  Block  strategy  etc. 

In  the  case  of  an  important  question 
concerning  a  literature  search  for  a  new 
research  contract  with  a  value  of  over 
100.000  $,  where  all  available 
information  is  needed,  a  trained  Online 
Information  Expert  should  search.  He 
will  use  the  Building  Block  Strategy, 
cluster  search  etc.  He  will  make  a  ZOOM 
on  Journal  names  to  find  the  names  of 
the  Core  Journals,  to  be  able  to  read 
the  latest  issues  that  are  not  yet 
searchable  in  databases.  He  will  also 
make  a  ZOOM  on  authors,  to  find  the 
names  of  Core  Authors.  (Table  III). 
These  names  can  be  used  for  searching 
the  Science  Citation  Index.  The  Online 
Expert  will  also  search  for  information 
about  Conferences,  Syn^sia  (Table  IV). 

Although  intelligent  interfaces  can  help 
Novice  End-users  and  Subject  Experts, 
I  do  not  expect  that  intelligent 
interfaces  will  be  able  to  make  and 
interpret  ZOOM  searches,  searches  in  the 
Science  Citation  Index,  searches  for 
Conferences  and  Syti^sla  etc. 


NATURAL  LANGUAGE  ;  I  SAW  HER  DUCK 


Natural  language  sometimes  is  not  very 
easy  to  understand,  because  sene  of  the 
words  in  a  sentence  may  have  different 
meanings.  Saw  can  mean  cutting  with  a 
saw  or  it  can  be  the  past  time  of  the 
verb  see.  Duck  can  be  a  bird  or  a  verb 
meaning  bending.  If  the  female  person 
should  live  in  the  neighbourhood  of  an 


airport  she  might  have  made  a  bow 
because  of  the  noise  of  a  plane. 

When  you  want  to  translate  a  sentence 
or  answer  a  question,  you  must  know  what 
the  sentence  means  or  what  the  requester 
means.  If  you  don't  understand  the 
sentence  or  the  question,  you  have  to 
ask  a  second  question  to  check  the 
context.  In  case  the  word  "and"  is  used 
in  the  question  "Do  you  have  something 
about  aeroplanes  and  meterology,  you 
have  to  know  whether  the  meaning  is  : 

-  union  (aeroplanes  OR  meteorology)  or 

-  intersection 

(aeroplanes  AND  meteorology).  Major 
problems  in  understanding  natural 
language  are  :  ambiguity,  ijiprecision, 
incompleteness,  inaccuracy  (SCHA84) 

PARSING 

Parsing  is  a  process  that  is  used  during 
automatic  translation  (ALBE90)  and 
automatic  indexing  (EVAN91) .  During  the 
parsing  proces  verbs,  nouns  and  their 
logical  order  are  recognised.  This 
parsing  process  is  also  employed  to 
understand  the  meaning  of  a  question 
that  will  be  processed  in  online 
database  searching.  It  then  has 
similarities  with  the  process  of 
splitting  up  a  question  into  concepts 
or  facets,  which  is  employed  in  the 
online  Building  Block  Method.  Several 
types  of  parsers  exist  : 

-  Context-free  parsers  (late  50s  and 
early  60s),  which  attempts  to 
decompose  a  sentence  by  succesively 
applying  a  series  of  derivations  such 
as  : 

1.  sentence  (S)  =  noun  phrase  (NP)  + 
verb  phrase  (VP) 

2.  noun  phrase  =  article  (T)  + 

noun  (N) 

3.  verb  phrase  =  verb  (V)  + 

noun  phrase 

Bottom-up  parsing  starts  at  the  level 
of  the  individual  words.  Bottom-up 
analysis  uses  knowledge  about  language 
and  can  produce  accurate,  if  only 
partial  results,  in  arbitrary  texts 
and  texts  that  contain  unej^ected 
information  (JAC090). 

Top-down  parsing  starts  at  the  t<p  of 
the  tree.  Tqp-doWn  analysis  is  much  more 
tolerant  of  unknown  words  and  grammati¬ 
cal  lapses,  but  is  often  fooled  or 
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misses  information  in  unusual  situa¬ 
tions. 

-  Transformational  parsers  (late  60s) 
generate  the  deep  structure  of  a 
sentence,  representing  its  syntactic 
and  semantic  interpretation. 

-  Augmented  transition  network  (ATH) 
parsers  (70s-)  have  the  same  power 
as  a  transformational  graninar  but 
are  more  straightforwcurd  (K0RY90) 

Copestake  and  Spcirck  Jones  propose  an 
analyser  which  carries  out  syntactic  and 
semantic  processing,  based  on  a  general 
purpose  grainnar  and  a  domain-dependent 
lexicon,  and  a  translator  which  is 
responsable  for  producing  the  database 
query .  They  use  the  term  analyser  rather 
than  parser,  because  parsing  is 
frequently  taken  to  refer  to  syntactic 
processing  alone  (SPAR90). 

Simple  parsing  yields  a  a  single  parse 
tree,  even  if  the  sentence  is  ambiguous 
and  can  be  parsed  several  ways.  At  this 
stage  the  parsing  is  purely  syntactic. 
Sophisticated  parsing  yields  a  parse 
forest  composed  of  all  parses  that  the 
graitmar  allows .  Semantic  analysis  of  the 
parse  forest  will  yield  a  most  likely 
interpretation,  which  beccxnes  the 
inter lingua  interpretation. 

An  inter lingua  representation  details 
the  syntax  of  a  sentence  and  includes 
enough  semantics  to  increase  the 
llkelyhood  of  creating  an  accurate 
synthesis.  Elements  of  the  representati¬ 
on  are  actually  coded  as  nximbers, 
that  are  indexes  to  multilingual 
dictionary  entries  and  phrase  structure 
templates . 

The  Distributed  Language  Translation 
system  being  devel<q>ed  by  BSO  Language 
Translation  in  the  Netherlands  uses 
Esperanto  as  its  Interlingua  (BENT91). 

mTQRU.  LMGOMZ  JWX3S8S  TO  DATABASES 


Let's  consider  an  untrained  user  in  a 
library  who  wants  information  about 
Austria's  clothing  industry.  In  this 
case  a  search  strategy  for  Dialog  might 
be  : 

ss  (clothing  or  garment  or  apparel)  and 
(austrla  or  austrlan) 


Because  the  untrained  user  is  unaware 
of  the  need  to  parse  a  query  into 
component  parts  he  enters  the  question 
"clothing  industry  in  austria" .  In  case 
the  hostccxoputer  can  recognize  "in",  an 
article  "Austria  sees  clothing  industry 
boom"  would  be  missed  because  there  is 
no  word  "in"  in  the  title.  In  case  "in" 
can  be  recognized  emd  discarded  as 
trivial  word,  the  words  "austria", 
"clothing"  and  "industry"  cein  be 
combined  by  an  automatic  Boolean  AND 
operator.  Neither  is  satisfactory  since 
the  results  of  the  search  will  not  be 
consistent  with  the  intention  of  the 
user  (STRI88). 

In  this  case  a  Natural  Language 
Interface  (NLI)  or  automatic  Natural 
Language  Processing  (NLP)  should 
translate  the  question  in  the  above 
search  strategy.  In  general  NLI  or  NLP 
make  the  translation  of  the  "free¬ 
form"  natural  language  expression  of 
information  needs  and  queries  by  users 
to  relevant  system  terms.  NLP  software 
may  also  construct  the  system  query  and 
trigger  the  query  operation.  NLP  often 
exploits  existing  system  indexes, 
thesauri  and  front-end  dictionaries  or 
semantic  networks  when  performing  one 
or  more  of  these  functions. 

NLP  linguistic  analysis  techniques  have 
been  used  successfully  in  Online  Public 
Access  Catalogs  (OPAC's)  to  assist  with 
morphological  (variant  but  equivalent 
word  forms)  and  syntactical  (variant  but 
equivalent  phrases)  query-document 
matching  problems. 

For  example,  some  routines  compensate 
for  word  spelling  or  suffix  variations; 
in  this  case  the  stemning  algorithm  of 
Porter  may  be  used  ( PORT80 ) .  These 
algorithms  were  designed  to  conflate 
terms  that  are  morphologically  similar. 
The  assumption  is  that  they  will  be 
semantically  close.  Other  match  direct 
and  inverted  forms  of  subject  head¬ 
ings.  In  some  systans  these  approaches 
are  combined  to  improve  retrieval 
effectiveness  (HILD89). 
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3.1  mTDSAL  UUIGOnGB  mTERnCBS  TO 
STRlKTraSED  DKTKBIkSBS 

There  is  a  considerable  body  of 
literature  on  Natural  Language  (NL) 
interfaces  to  structured  (non-text 
mainly)  databases  (FIN186)  (NOAC90). 
There  have  been  a  number  of  Conference 
Panels  on  Natural  l^anguage  Interfaces 
(NLI),  such  as  the  20th  Annual  Meeting 
of  the  ACL,  the  10th  International 
Conference  on  Computational  Linguis- 
tlcs/22nd  Annual  Meeting  of  the  ACL  and 
the  25th  Annual  Meeting  of  the  ACL 
(NEIS8T ) . 

LADDER  was  developed  by  Hendrix  in  1978, 
TQA  by  Damerau  of  IBM  in  1980,  IRUS  by 
Bates  in  1986,  TEAM  by  Grosz  in  1987  and 
JANUS  in  1989.  The  TQA  system  is  a  NL 
front-end  todatabases,  which translates 
NL  queries  into  SQL  expressions,  which 
are  then  evaluated  against  the  database. 
The  Structured  Query  Language  (SQL)  is 
a  database  querying  language  codified 
as  a  standard  by  the  American  National 
Standards  Institute.  SQL  is  based  on 
the  Relational  modeling  of  data  in 
database  management  systems  (DBMS). 
These  new  RDBMS  systems  (Oracle,  Sybase, 
Ingres,  Informix  etc)  are  oriented  to 
distributed  application  on  local  area 
networks  with  client/server  architec¬ 
ture  (CSA).  Because  querying  directly 
with  SQL  is  too  complicated  for  non¬ 
technical  people,  an  user-friendly 
interface  is  needed  (LEIG89).  EDA/SQL 
gives  easy  universal  database  access  to 
DB2,  Oracle,  dBase,  even  MUMPS  (RICC91). 

Commercial  NLI  systems  which  have  been 
sold  more  generally  Include  INTELLECT 
(derived  from  Harris's  ROBOT),  which 
runs  on  IBM  mainframes  and  is  used  by 
the  Library  of  Congress  to  access  its 
structured  personnel files (MARN88),QSA 
(developed  by  Hendrix  following  the 
eiqierience  of  LADDER)  and  Natural 
Language  of  Natural  Language  Inc., 
Berkeley  (SPAR90). 

Natural  Language  of  Natural  Language  Inc 
can  generate  charts  and  graphs  and 
«i(>loys  syntactic  and  semantic  analysis 
of  a  query  with  a  Knowledge  Base  of  over 
100.000  English  concepts  and  words. 

SAPHIR  from  G3I  ERLI  is  comparable  with 
INTELLECT.  GURU  is  a  product  that 
supports  the  development  of  ejqjert 
system  Interfaces  to  a  (relational) 


database  management  system.  GURU  also 
has  capabilities  for  storing  and 
manipulating  rules  of  a  sort  that  make 
possible  the  understanding  of  simple 
Natural  Language  queries  ( LEIG89 ) . 

INFOSTATION 

VTLS  Inc  of  Blacksburg  has  developed  new 
generation  software  on  a  NeXT  conputer, 
the  VTLS  InfoStation,  a  multimedia 
access  system  for  library  autcanation. 
The  multimedia  software  of  Intermedia 
allows  for  Hypermedia  links  across 
documents  containing  text  and  graphics . 
An  unique  feature  of  the  Infestation  is 
its  ability  to  understand  Natural 
Language  queries.  An  Expert  System 
interprets  user  questions,  sends 
requests  to  remote  databases  and  even 
learns  new  vocabulary  (LEE90). 

SCISOR 

SCISOR  is  a  prototype  Natural  Language 
text  processing  system  which  was 
developed  by  Jacobs  since  1986  for 
analysing  stories  about  corporate 
mergers  and  acquisition  from  the  Dow 
Jones  database.  The  design  of  SCISOR 
combines  lexical  analysis  and  Natural 
Language  analysis,  knowledge  r^resenta- 
tion  and  information  retrieval 
techniques.  The  Natural  Language  part 
consists  of  a  bottom-up  petrser  TRUMP 
combined  with  a  top-down  parser  TRUMPET . 

A  topic  analyzer  looks  for  some  150 
prespecified  keywords  such  as  buy, 
merger  etc.  SCISOR  uses  a  general 
lexicon  of  about  10.000  word  roots,  with 
links  to  a  core  concept  hierarchy  of 
1.000  general  conceptual  categories. 

SCISOR  has  been  ported  to  the  domain  of 
military  messages  as  a  part  of  the  MUCK- 
II  project  (JAC090). 

NLH/E 

Because  menus  and  hypertext  systems  eure 
unsatisfactory  in  help  situations. 
Halter  Tichy  has  built  NIH/E,  a  Natural 
Language  Help  system,  which  answers 
questions  that  are  entered  as  typed. 
Natural  Language  sentences.  NLH/E  is 
built  with  a  novel  casef  rame  parser  that 
operates  with  a  thesaurus,  case 
inheritance,  and  noun/ verb  phrase 
unification  (TICH89). 


3.2  mTOBAL  IMHSaat  niTBSmCBS  TO 
IKILTIDISCIFLIIllUnr  mnSASES 


when  it  comes  to  Natural  Language 
Interfaces  to  multidisciplinaary 
bibliographic  databases  and  ill- 
structured  full-text  databases, 
generally  other  systems  are  mentioned 
in  the  literature,  such  as  the  CITE, 
OKAPI,  ALEXIS/DIANEGUIDE/I'M  GUIDE, 
PLEXUS/TOME/MITI,  DGIS/STINET, 
SPIRIT/EMIR/DIALECT  and  Alpha  DIDO 
projects . 

I  will  first  give  attention  to  Online 
Public  Access  Catalogues  (OPACs)  in 
llbretries,  because  OPACs  are  used  by 
end-users  that  form  a  large  population 
with  extremely  varied  needs  and 
backgrounds  (MITE89). 

The  initial  OPAC  was  derived  frcm 
circulation  or  cataloguing  systems  and 
gives  precoordinatedphrase  access  to 
separate  indexes  (subject  headings, 
title,  authors),  which  is  appropriate 
for  known  item  search.  The  second- 
generation  OPAC  systems  are  more  similar 
to  the  conmercial  Information  Retrieval 
systems;  they  provide  keyword  or  free- 
text  access,  corresponding  to  post¬ 
coordinate  IR  principles.  Their 
interfaces  are  generally  command- 
driven,  making  use  of  Boolean  logic 
which  is  well  suited  for  subject 
searching. 

The  third-generation  OPACs  combine  the 
features  of  the  first  wo  OPACs  by 
providing  both  phrase  (known  item) 
searching  and  keyword  searching 
(MITE89). 

3.2.1  CITB 

CITE,  the  third-generation  OPAC  at  the 
National  Library  of  Medicine  (NLM), 
supports  Natural  Language  queries  emd 
performs  intelligent  stemming 
( truncation)  on  the  user '  s  search  terms . 
Stemned  query  words  are  looked  up  in 
both  free  text  Indexes  and  the  MeSH 
(Medical  Subject  Headings)  Thesaurus. 
Search  words  found  in  free  text  are  then 
autoaiatlcally  linked  to  associated  MeSH 
descriptors . 

The  CITE  retrieval  methods  Include  term 
weighting,  combinatoric  searching, 
cloeest -match  search  strategy,  relevance 
feedback  (dynaadc  user  feedback),  query 
eiqpansion,  and  ranked  docuaMnt  output. 
Docuawnts  estimated  to  be  snst  relevant 


are  output  at  the  t<^  of  the  list. 

To  begin  a  seeurch,  CITE  invites  the  user 
to  "Type  your  search  question". 
Significant  words  in  the  query  (those 
not  stoplisted)  are  passed  through  a 
customized  stemning  algorithm  which 
conflates  variemt  word  forms,  but  also 
removes  endings  very  common  in  medical 
vocabulary  (e.g.  -itis,  -ectomy). 

Even  without  employing  semantic  aids 
such  as  synonym  tables,  or  a  front- 
end  dictionary  or  sanantic  network  which 
would  map  potential  search  terms  with 
related  MeSH  descriptors,  CITE  achieves 
a  considerable  degree  of  semantic  query 
e;q>ansion  using  its  automatic  and  semi¬ 
automatic  processes. 


3.2.2  CKAPI 


OKAPI  is  a  prototype  third-generation 
OPAC  developed  by  a  research  team  (Mitev 
and  Nalker)  at  the  Polytechnic  of 
Central  London.  OKAPI -84,  supported 
Natural  Language  subject  searching , with 
weighted  term,  combinatoric  retrieval. 
OKAPI-84  also  employed  seeu'ch  decision 
tree-based  rules  to  automatically  change 
search  strategy  when  one  attempt  failed 
to  produce  retrieved  it«ns.  Although 
the  1984  version  had  flexible  retrieval 
routines  built  in,  matching  on  entry 
words  had  to  be  nearly  exact  (known- 
it«n  search) . 

OKAPl-86,  a  newer  version,  tested 
several  linguistic  computing  techniques 
aimed  at  iiproved  term  matching  : 
automatic  "WEAK"  and  "STR(»»G"  steaming 
(truncation)  of  search  terms,  automatic 
cross-referencing  and  sani-automatic 
spelling  correction.  The  research  team 
installed  Porter '  s  stemming  algorithms 
in  OKAPI-86.  The  "WEAK"  stemming  routine 
normalized  singular  and  plural  noun 
forms,  euid  removed  possessive  endings ,  - 
ed's  and  -ing's.  Spelling  variations 
(e.g.  UK  and  US  forms)  were  incorporated 
into  the  weak  stemming  procedure.  A 
stronger  routine  was  tested  but  later 
rejected. 

A  measure  of  synonym  control  was 
atteapted,  ^lecificedly,  autcBBtic  cioes 
referencing,  using  a  look-up  table  which 
had  3  kinds  of  entries  :  stop  words, 
synonym  pairs  that  would  adversely 
eiffected  by  conflation  (e.g.  child,  md 
children),  abbrevationa  and  their  full 


expressions,  words  with  alternative 
spellings  (jail,  goal)  and  equivalent 
word  pairs  suggested  by  query  terms 
found  in  logged  searches  (Great  Britain, 
UK). 

OKAPI-86  OPAC  corrected  about  half  of 
the  spelling/miskeying  errors  with 
favorable  results .  The  cross  referencing 
always  helped  when  it  was  called  into 
play.  About  25  %  of  the  searches  studied 
contained  a  word  or  phrase  that  matched 
one  in  the  cross  references  list.  Very 
clearly,  OKAPI-86  has  shown  that  subject 
retrieval  performance  can  be  improved 
through  the  use  of  automatic  linguistic 
term  matching  aids  (HIL089). 

Several  functions  of  OKAPI  have  been 
integrated  in  the  LIBERTAS  system  of 
Swalcap  Library  Systems  (SLS). 

A  conparlson  study  of  LIBERTAS  and  OKAPI 
was  published  in  1989  (HILD89) .  LIBERTAS 
is  the  only  "fourth-generation”  OPAC 
commercially  available  that  gives 
intelligent  assistance  and  which  ranks 
the  result  of  a  search. 


3.2.3.  PLEZDS/TOMB/MITI 


PLEXUS  is  essentially  a  Natural  I,anguage 
system,  centered  around  the  use  of  the 
facet  classification  scheme  BSO,  or 
Broad  Subject  Ordering  (ALBE90). 
PLEXUS  was  designed  since  1983  by  Alina 
Vickery  and  Helen  Brooks  as  a  prototype 
expert  system  for  referral  under  a 
British  Library  R  &  D  Department  grant 
at  the  Central  Information  Service  of 
the  University  of  London.  PLEXUS  was 
designed  to  help  librarians  to  deal  with 
referral  queries  about  Gardening.  It  has 
a  oict ionary  of  terras,  a  (knowledge) 
database  of  500  records  of  referral 
resources  (general  gardening  reference 
works,  societies  and  experts)  and  a 
hierarchical  classification  of  the 
subject  doBialn  "Gardening”,  by  facets 
using  the  Broad  System  of  Ordening 
(BSO). 

Input  to  it  is  free  form  and  can  be 
Natural  Language  :  a  list  of  terms,  a 
phrase,  or  a  sentence.  The  user's  ii^t 
is  passed  through  a  parser  to  extract 
the  significant  words  which  are  then 
used  to  construct  a  Boolean  query  for 
the  dattdMse .  ( HAIIK88 ) .  The  securch 

employs  )uiowledge  of  search  strategies 
and  tactics  to  broaden  and  narrow  the 


search . 

The  search  strategy  is  modified 
automatically  if  the  user  wcints  more  or 
less  information. PLEXUS alsoconstructs 
an  user-model  :  by  a  series  of  questions 
it  assesses  the  level  of  experience  of 
the  user  (MITE89). 

To  handle  input  in  Natural  Language  a 
stopword  list  and  stemming  rules  were 
needed.  The  stopword  list  contains  about 
1400  terms  :  all  articles,  prepositions, 
conjunctions,  pronouns  and  auxiliary 
verbs  are  removed  as  well  as  many 
general  words.  (VICK88)  (BR0087). 

TOHB  Searcher 

TOME  Searcher  is  essentially  a  Natural 
Language  system  centered  around  the 
INSPEC  Thesaurus .  In  1987  it  was  decided 
to  develop  the  PLEXUS  system  outside 
the  University  into  an  intelligent 
interface  to  online  databases  on 
hostcomputers .  This  product,  TOME 
Searcher,  was  launched  in  the  summer  of 
1988  by  TOME  Associates. 

TOME  Searcher  was  developed  for 
professionals  in  electrical  and 
electronic  engineering,  ccrputer  science 
and  information  technology  (INSPEC). 
The  system  has  the  functions  of 
automatic  dialup,  logon,  file  selection 
and  transmission  of  a  search  statement 
to  the  host  computer,  using  the 
Comnon  Command  Language  (CCL)  of  ESA. 

Because  semantic  categories  in 
electronic  engineering  etc  eure  con¬ 
siderably  different  from  those  in  the 
biological  domain,  a  new  semantic 
analysis  of  the  terminology  was  needed. 
TOME  Searcher  develops  a  first  search 
strategy  and  automatically  assesses  the 
probable  hit  rate  before  going  online, 
by  consulting  a  thesaurus  which  contains 
the  posting  data  frcm  the  databeuses 
covered  (VICK90). 

Because  the  terminology  changes,  the 
TOME  Searcher  Thesaurus  has  to  be 
updated. 


MITI  is  essentially  a  Natural  Language 
syst«n,  centered  around  the  use  of  the 
faceted  ROOT  Thesaurus  of  the  British 
Standards  Institute  (VICR90). 

The  system  will  initially  access  four 
hosts  ;  STM,  ESA,  Telesystames  and  ECHO, 
but  will  be  extendable  to  any  host. 


of  Search  Output  I— |  Logoff 


The  target  domains  of  the  current  MITI 
will  be  :  General  Technology  and 
Science,  Environmental  Issues  (VICK90) 
MITI  will  develop  an  Intelligent 
Multilingual  (English,  French,  German, 
Spanish)  Interface  which  can  be 
installed  on  a  Personal  Ccm^juter.  It 
will  enable  Untrained  Users  to  have 
access  to  different  databases  on  a 
number  of  hosts  In  a  uniform  way.  using 
Natural  Language  (CORD90) 

MITI  combines  the  best  properties  of  the 
TOME  Seeurcher  and  the  lANI  interface 
of  the  Scandinavian  network  (BOUM90). 
In  October  1991  the  MITI  project  seemed 
to  be  withdrawn. 


3.2.4  DIMOXJDIDB/I'M  GDHS/NIM 

The  French  firm  EKLI  (Etudes  et 
Recherche  Linguistique  et  Informatique) 
was  founded  in  1977  and  now  has  some  70 
employees  working  for  government  and 
private  companies  on  Artificial 
Intelligence  and  Retrieval  (Con^juta- 
tional  Linguistics).  In  1990  France 
Telecom  has  taken  shares  in  GSI-ERLI. 
GSI-ERLI  has  developed  several  products 
such  as  : 

-  SAPHIR,  a  Natural  Language  Analyser, 
type  ATN  and  comparable  with  INTEL¬ 
LECT,  for  Interrogating  factual 
databases,  which  is  available  for  IBM 
mainframes  working  under  MVS  and 
VM/CMS  (SQL  and  DB2).  SAPHIR 
translates  a  question  formulated  in 
Natural  Language  for  a  relational 
database  into  SQL.  The  user  does  not 
have  to  be  familiar  either  with  the 
structure  of  the  database  or  SQL. 

-  MLS,  a  Natural  Language  Systran  to 
query  the  2500  professional  headings 
of  the  French  Yellow  Pages  directory, 
which  is  available  to  4  million  End- 
Users  with  Mlnitel  terminals  (C1EMB8). 

GSI-ERLI  has  devel<q>ed  Natural  Language 
database  access  software  which  uses 
the  firm's  prc^ietary  ALEXIS  database 
managaaient  software .  ERLI '  s  Intelligent 
front-end  ( sometimes  ref ered  to  as  ALEX- 
DOC  or  "HLQP"  (for  natural  language 
query  processor)  can  be  adapted  to  a 
variety  of  retrieval  and  database- 
systems,  including  OPACs.  ERLI  HLQP  for 
bibliographic  retrieval  has  been  adopted 
as  the  retrieval  software  for  access  to 
the  French  online  subject  authority  file 


RAMEAU.  The  RAMEAU  file  is  maintained 
by  the  Bibliotheque  Rationale  and  is 
available  as  public  file  to  libraries 
via  SUNIST,  the  national  university 
network  for  STI  (HILD89).  RAMEAU  is 
based  on  the  Libreury  of  Congress  Subject 
Headings  (LCSH).  RAMEAU  contains  some 
100 . 000  Subject  Headings ,  which  are  used 
in  French  libraries  as  the  Conmon 
Subject  Indexing  language  (J0UG89). 

DIANEGUIDE 

The  European  Ccmamunity  has  given  a 
contract  to  GSI-ERLI  to  devel<^  a 
Natural  Language  Interface  for  the 
DIANEGUIDE  database,  which  contains 
information  in  9  languages  about  some 
1500  European  databases.  Information  is 
given  about  : 

-  the  database  (10  fields,  such  as  name 
and  subject) 

-  the  host  (11  fields,  such  as 

retrieval  softw^u^e) 

-  the  producer  (9  fields) 

The  subject  of  a  database  is  defined  by 
some  25  keywords  and  a  suinnary. 

The  dictionaries  which  have  been  used 
were  very  detailed  for  finance,  culture 
and  sports,  but  less  detailed  for 
Justice  und  Geography.  This  could  be 
solved  by  manual  and  intellectual 
control  of  the  automatic  indexing  of  the 
databases . 

In  the  Natural  Language  access  an 
untrained  End-User  can  ask  : 

Nhat  are  the  databases  dealing  with 
Medicine?  The  answer  is  :  301.  He  then 
can  choose  to  1.  See  the  results,  2. 
Ncurrow,  3.  Broaden,  or  4.  Abandon 

He  will  probably  choose  to  Narrow  by  : 

1.  Language,  2.  Database  type 
When  he  types  1  the  following  display 
will  appear  : 

1.  I,anguage  =  English  200  hits 

2.  Language  =  German  100  hits 

3.  Language  ^  French  1  hits 

But  sui^pose  that  an  End-User  wants  to 
find  information  about  the  subject 
"Interactive  Videodisc".  He  wants  to 
type  "Interactive  Videodisc”  and  get  a 
list  of  dat^iases  that  have  that 
information,  and  preferid>ly  t)w  best. 
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Mhen  the  the  subject  of  the  database  is 
described  by  M  keywords  such  as 
Medicine,  he  might  get  an  Mswer  Zero 
on  his  question. 

But  the  NL  Dianeguide  has  a  Matching 
Motor  (Moteur  d'appariement  MA)  which 
will  automatically  broaden  the  query. 
First  a  predicate  (AHD/OR)  tree  is 
produced  for  predicates  whose  values 
index  at  least  one  record  in  the 
database.  The  Matching  Motor  then 
searches  for  all  descriptors  which  are 
semantically  close  to  the  initial 
descriptor  (automatic  broadening  of 
query) . 

But  when  Interactive  Videodisc  isn't  in 
the  database,  there  is  no  broadening. 

I'M  GUIDE 

Information  about  European  CD-ROM 
products  has  been  added  to  the 
Dianeguide  database,  which  then  was 
combined  with  a  Brokersguide .  The  new 
product  was  called  I'M  GUIDE.  Databases 
are  classified  in  41  Categories  such  as 
Medicine,  Education.  77  databases  exist 
in  the  category  Education.  Although 
Natural  Language  Access  is  available, 
the  I'M  GUIDE  still  has  no  answer  for 
the  question  Interactive  Videodisc. 

HIM 

The  knowledge  that  was  developed  during 
the  Dianeguide  project  has  been  used  by 
GSI  for  another  IMPACT  project,  the 
Multilingual  Interrogation  Module  (MIH). 
HIM  has  been  developed  for  a  database 
of  400  pages  "A  people's  Europe",  which 
contains  11  chapters  with  information 
about  Equal  treatment,  Europe  without 
frontiers  etc.  40.000  words  have  been 
used  for  the  English  version,  60.000  for 
the  French  equivalent. 

A  multilingual  dictionary  has  been  used 
with  14.000  words  :  8.500  UK/FR  links, 
6.500  UK/IT  links  and  4.900  FR/IT  links. 

CURRENT  CONTENTS  BASED  DATABASE 
DIRECTORY  (CCDD) 

Because  Natural  Language  is  available 
in  titles  of  Journal  articles,  a 
Database  Directory  slight  be  based  on  the 
CurrMit  Contents  of  soaw  4000  European 
Scientific  and  Technical  journals 
(6  languages  :  Multilingual  Contents) 
and  2000  from  the  US. 


For  each  Journal-name  details  could  be 
given  about  the  databases  that  index  & 
abstract  articles  from  that  journal.  In 
that  case  an  untrained  End-User  can  ask 
"1.  *-eractive  Videodisc",  or  "Intelligent 
Information  Retrieval",  find  some 
relevant  titles  of  articles  6md  see  in 
which  journals  and  databases  he  can  find 
more. 


3.2.5.  DGIS/STHST/OOIIIT 

In  1980  Gladys  Cotter  wrote  a  report 
"Commercial  database  searching  :  a 
proposed  additional  DTIC  user  service" 
(COTT80).  This  report  became  a 
forerunner  of  the  DGIS/STI  Program, 
"STINET",  which  was  described  in  detail 
in  1987  (COTT87).  The  STINET  is  based 
on  several  related  development  projects 
such  as  the  Gateway,  the  Local 
Automation  Model,  the  Directory  of 
Resources ,  the  Conmon  Comnand  Language , 
Post-Processing ,  an  End-User  Interface , 
an  Expert  Link  and  an  Electronic 
Document  System. 

A  DGIS/STINET  bibliography  was  compiled 
in  december  1988  (KUHM88). 

The  end  goal  of  STINET  is  to  bring  these 
components  together  into  a  coherent 
and  comprehensive  whole  and  allow  users 
to  interact  with  information  retrieval 
systems  via  an  expansive  Natural  Lan¬ 
guage  Interface. 

THE  DGIS  GATEWAY 

The  DGIS  Gateway  is  based  on  the 
Technology  Information  System  (TIS)  that 
was  develcped  since  1975  by  Victor 
Hanpel  at  Lawrence  Livermore  National 
Laboratory  (LLNL)  under  the  sponsorship 
of  the  Department  of  Energy.  Its 
Intelligent  Gateway  Processor  (IGP) 
software  was  conceived  in  1975  as  a 
Table-driven  interpreter  for  the 
creation  of  integrated  Information 
Systons,  the  "Meta  Machine"  (HAMP79)  .The 
translation  of  dissimilar  communication 
protocols  in  addition  to  the  translation 
of  cooniands  and  formats  Is  carried  out 
by  the  IGP's  with  m  advanced  version 
of  the  Network  Access  Machine  (NAM) 
softweute  which  was  devel<^>ed  by 
Rosenthal  and  Lucas  for  NBS.  The  NMI 
Boftweure  was  integrated  in  TIS  In  1978 
and  cospletely  rewritten  In  IMS  for 
TIS/ICP  use  (B0RT86). 
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DGIS  itself  is  a  low-level  Al-like 
system,  which  operates  at  DTIC  on  an 
integrated  BSD  UNIX  and  INGRES  based 
software,  called  the  IGP  Toolbox.  After 
a  prototype  DGIS  had  been  running  on  a 
VAX  11/780  using  UNIX,  DGIS  now  resides 
on  a  Pyramid  98X  Minicomputer  at  DTIC. 
In  1986  DGIS  was  evaluated  by  an 
professional  information  specialist. 
DGIS  now  connects  to  21  database  systems 
in  native  mode. 

LOCAL  AUTOMATION  MODEL  (LAM) 

In  the  US  there  are  500  Defense 
libraries  with  over  10  employees.  Some 
of  them  are  run  by  contractors.  Because 
DoD  libraries  felt  the  need  to  automate 
local  infonnation  collections,  DTIC  in 
1983  initiated  the  development  of  a 
Library  Automation  system,  responsive 
to  the  local  library  management  and 
networking  needs  of  DoD  libraries.  The 
functional  description  of  the  Local 
Automation  Model  (LAM)  was  presented  in 
October  1983  (HAMia3).  Based  on  this 
study  a  prototype  system  was  specified, 
the  Integrated  Bibliogrs^ic  Information 
System  (IBIS).  IBIS  would  encourage 
wider  participation  of  libraries  in 
DTIC's  Shared  Bibliographic  Input 
Network  (SBIN)  (00TT86). 

A  single  comnand  set  should  be  availftble 
for  local  and  central  functions. 

66  packages  were  identified,  but  no 
single  system  provided  all  the  functions 
required.  Because  Integrated  library 
systems  supported  most  of  the  library 
functions  and  gateway  systems  supported 
external  database  access ,  but  no  package 
supported  both,  it  was  decided  that  the 
Intelligent  Gateway  Processor  (IGP)  of 
DGIS  should  form  the  gateway  part  of 
IBIS.  6  packages  were  Identified  for 
further  selection.  August  1985  the 
selected  package  was  integrated  with 
the  gateway  software  (HART85).  As 
refinooent  of  the  specifications 
progressed,  DTIC  was  joined  by  the 
Library  of  Congress  (LC)  in  its  effort 
to  develop  a  system  suitable  for  use  in 
Federal  Libraries.  In  1986  two  versions 
of  the  LAM/ IBIS  prototype  were  tested 
(COTT87).  September  1988  LC  awarded  a 
contract  to  SIRSI  for  thr  Scientific  and 
Technical  Information  Local  Automation 
System  (STILAS).  An  open  contract 
between  the  Library  of  Congress  and 
SIRSI  enables  Federal  Governmnt 
libraries  to  purchase  STILAS  though  the 


FEDLINK  program.  STILAS  is  a  turnkey- 
system,  based  on  a  combination  of 
BRS/Search  for  retrieval  and  SIRSI 's 
popular  UNIX-based  modular  Unicorn 
Collection  Management  System,  with 
modules  for  the  online  public  access 
catalog,  circulation  control  etc.  The 
STILAS  gateway  permits  simultaneous 
interaction  with  databases  on  Dialog  or 
BRS,  while  also  searching  the  local 
files .  The  universal  access  mode  is  made 
possible  by  the  Retrieval  Interface 
Manager  (RIM) .  Essentially  RIM  is  a 
translator,  converting  STILAS  comnands 
( in  a  format  based  upon  BRS/Search)  into 
the  formats  required  for  other  systems 
(NEWT89). 

DIRECTORY  OF  RESOURCES 

A  Directory  of  online  databases  was 
developed,  which  contains  information 
on  the  content  and  scope  of  databases 
relevant  to  the  interests  of  DoD.  The 
Directory  is  making  use  of  the  INGRES 
RDBMS,  which  permits  easy  programning, 
unified  use  of  native-mode  or  menu- 
driven  mode  for  searching  the  Directory . 
But  INGRES  lacks  support  for  large  text 
fields  and  full-text  retrieval.  The 
Directory  is  subject-searchable,  so  that 
on  entering  the  topic  of  interest, 
the  user  is  provided  a  listing  of 
^:propriate  databases  (KUHN88)  (KRUE90). 

COMMON  COMMAND  LANGUAGE  (CCL) 

The  DGIS  Conmon  Comnand  Language  (CCL) 
is  a  project  to  access  the  multipli¬ 
city  of  information  systems  with  a 
standard  comnand  language.  Because  DGIS 
is  a  UNIX/C  based  system,  the  CCL  began 
in  1986  with  UNIX/C  programning.  Later 
on  PROLOG  was  chosen  to  translate  a  CCL 
comnand  into  a  comnand  of  a  target 
database .  Based  on  the  design  goals  the 
CCL  was  structured  as  a  knowledge- 
based  system  and  evolved  into  a  Comnon 
Comnand  Language  System  ( CCLS ) .  DGIS  CCL 
is  based  on  the  NISO/ANSI  CCL  (NIS087) . 
DGIS  CCL  will  gradually  migrate  from  l. 
structured  language  of  NISO  CCL  to 
Natural  Language .  PROLOG  will  be  coupled 
with  a  relational  dbms  with  an  SQL 
interface  so  that  it  can  work  with  any 
RDBMS  (KUHI»8).  The  DGIS  CCL  is 
currently  limited  to  single  database 
access  to  major  information  systmns 
DROLS/DTIC,  NASA/RECOM  and  3  database 
vendors  BRS,  Dialog,  ORBIT  (TRMtes). 
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END-USER  INTERFACE 

Late  1984  Telebase  started  marketing  of 
the  Easynet  system,  which  contained 
several  conponents  that  should  become 
available  in  DGIS/STINET.  Because 
Easynet  became  very  popular  for  End- 
User  searching,  the  Easynet  method  was 
integrated  in  DGIS/STINET  as  the  Menu- 
Aided  Easy  Seraching  Through  Relevant 
Options  (SearchMAESTRO).  It  provides  an 
easy  way  to  search  hundreds  of 
Government  and  comnerclal  databases 
without  knowing  the  individual  (native) 
search  command  language  for  each 
database.  The  prototype  SearchMAESTRO 
was  tested  by  approximately  30  users. 
The  SearchMAESTRO  was  offered  as  a  DTIC 
operational  service  in  October  1987. 
Fall  1990  an  User  Needs  Questionnaire 
was  sent  to  all  users.  The  most  frequent 
cited  reason  for  using  SearchMAESTRO  was 
that  it  eliminates  the  need  for  multiple 
accounts .  The  most  useful  features  were 
SOS  and  SCAN.  SOS  gives  immediate  online 
assistance  from  search  experts,  and  SCAN 
simultaneously  searches  several 
automatically-selected  databases  in  a 
subject-aerea.  It  provides  a  list  of 
databases  with  the  number  of  hits. 

In  the  interest  of  the  more  experienced 
searchers  a  menu-driven  CCL  interface 
was  added  to  the  SearchMAESTRO  in 
february  1991.  CCL  allows  for  proximi¬ 
ty  searching  and  the  building  of  sets. 
Extensive  HELP  screens  and  database- 
specific  documentation  is  availbale 
while  using  the  CCL  feature.  A  TOTAL 
conmand  tells  the  CCL  user  how  much  he 
has  spent  during  the  search  session. 
Either  ISO  or  NISO  common  commands  can 
be  used  to  search  in  the  databases  of 
the  vendors  BRS,  Dialog,  Profile,  Wilson 
and  VU/TEXT  (GREE91). 

NL  QUERY  BUILDING  EXPERT  SYSTEM 

Because  the  SearchMAESTRO  provides 
little  in  the  way  of  sophisticated 
assistance  for  developing  effective  and 
coirprehensive  search  strategies  of  the 
kind  a  human  expert  searcher  could  be 
expected  to  perform,  in  1988  a  limited 
nuni>er  of  DGIS  users  got  access  to  the 
COHXT  Advanced  User  Assistance  on  a 
Multics  mainframe  conputer  at  MIT. 
OOHIT,  an  acronym  tor  "Connector  for 
Networked  Infomation  Transfer",  was 
devalued  since  1981  by  Richard  Marcus 


and  includes  a  Conmon  Conmand  Language 
and  a  menu-oriented  interface  mode 
(NARC88) .  CONIT  is  able  to  take  a  user's 
Natural  Language  phrase  and  to  apply 
an  all-fields  keyword-stem  approach  to 
the  automatic  translation  of  the  user's 
search  request  into  a  search  strategy 
for  any  database  and  system.  CONIT  does 
not  maintain  Thesauri,  but  techniques 
for  automatic  phrase  deconposition, 
common  word  exclusion  and  steaming. 
These  techniques  relate  user's  Natural 
Language  topic  expressions  to  both  the 
free  text  and  Thesaurus  terms  in  the 
document's  database  records. 

Exanple  :  "digitized  document  retrieval" 

is  broken  down  into 

"digit",  "document"  and  "retriev" 

F  DIGIT?  AND  DOCUMENT?  AND  RETRIEV? 
(Dialog) 

Because  the  mainframe  CONIT  does  not 
make  use  of  modern  interface  techniques 
(e.g.  windowing)  and  the  explanations 
seem  to  have  over-wordiness ,  the  current 
version  of  CONIT  has  only  limited 
possibilities  for  providing  enhanced 
service  for  the  DGIS  community  (MABC88) 

The  CONIT  mainframe  version  was  ported 
to  a  partial  implementation  of  an 
"expert"  version  in  UNIX/C  in  a 
Miniccmputer  environment. 

Significant  results  were  achieved  in  the 
implementation  of  the  first  phase  of 
an  algorithm  that  automatically  ranks 
documents  according  to  relevancemodels . 
Also  developed  was  the  design  and 
partial  implementation  of  an  automatic 
search  strategy  narrowing  selector  based 
on  user  feedback  or  reasons  for  document 
irrelevance  (MARC90). 

Although  these  developments  are 
interesting  still  much  has  to  be  done 
before  QBES  is  operational  and  before 
End-Users  have  Natural  Language  access 
to  the  global  database  world. 

POSTPROCESSING 

The  DGIS  postprocessing  utilities  have 
been  based  on  the  bibliogrephic 
postprocessing  capabilities  of  the  TIS, 
which  are  described  in  a  paper  for  the 
Online  *82  Conference  (HAMP82)  and  in 
(BURT85). 
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These  capabilities  included  ;  Frequency 
analysis,  merging  of  files,  elimination 
of  duplicates,  cross  correlation  of 
fields  and  analysis  of  data  field  use. 

DTIC  now  also  wants  to  reach  the  End- 
user  coinnunity,  engineers  with  powerful 
workstations  and  spread-sheet  software . 
Because  user-needs  of  these  End-users 
are  different  from  the  user-needs  of 
bibliographic  database  searchers,  DTIC 
has  held  an  user-needs  survey.  In  a 
paper  of  this  AGARD-TIP  meeting  Buddy 
Haller  reports  about  this  user-needs 
survey. 

Summary  : 

Natural  language  access  is  not  yet 
available,  but  maybe  the  users  of 
SearchMAESTRO  and  STIIAS  are  quite 
pleased  with  their  systems 
and  don't  need  NL  access. 

3.2,6  SPIHIT/EIIIH/DIALECT 

SPIRIT  is  a  software  package  for  full- 
text  retrieval  with  reformulation.  It 
is  operational  since  1981  and  became 
popular  since  1988.  It  is  based  on 
Natural  Language  Processing  R6D  by 
Christian  Fluhr  of  INSTN  and  it  is  sold 
by  SYSTEX .  SPIRIT  combines  modular 
linguistic  processing  with  statistical 
processing  and  accepts  Natural  Language 
queries .  Text  segments  that  are  obtained 
as  answers  are  ranked  in  descending 
order  of  relevance.  Even  a  portion  of 
a  text  can  be  used  as  a  query. 

In  this  case  SPIRIT  calculates  the 
degree  of  semantic  proximity  between  the 
query  text  and  all  other  texts  in  the 
database  and  ranks  selected  documents 
in  descending  relevance  order.  The 
s«nantic  proximity  is  calculated  by 
using  weights  produced  by  a  statistical 
model  (FLUH89). 

EMIR 

European  Multilingual  Information 
Retrieval  (EMIR)  is  a  CEC  ESPRIT  pro ject 
of  Fluhr  and  SYSTEX,  that  will  cooplete 
a  feasibility  study  on  automatic 
indexing  of  free-text  and  multilingual 
query  of  textual  databases.  It  is  based 
on  SPIRIT  and  will  use  textual  data 
concerning  the  building  standards  of 
nuclear  plants  or  patent  euRnarles. 


DIALECT 

DIALECT  is  an  ^.-xpert  assistant  for 
Information  Retrieval  that  was  developed 
by  Bassano  (BASS86).  Fluhr  and  Bassano 
are  discussing  the  possibility  of 
ccanbining  EMIR  and  DIALECT. 

By  combining  EMIR  and  DIALECT  a  powerful 
multilingual  natural  language  syst^can 
be  developed. 


3.2.7  ALPHA  DIDO 

Hutton  +  Rustron  Data  Exchange  Ltd,  the 
publisher  of  the  UK  Defence  Equiimient 
Catalogue,  is  lead  contractor  and 
project  manager  of  a  two-year  CEC  IMPACT 
demonstration  project  for  developing  an 
online  information  service  for  the 
Construction  Industry,  which  allows  for 
Multilingual  enquiries  with  particular 
relevance  to  the  use  of  Standards. 

The  consortium  includes  The  British 
Standards  Institution,  which  will 
provide  machine-readable  data. 

The  project  involves  the  use  of  an 
intelligent  interface  and  domain 
knowledge  models  on  the  care  of  historic 
buildings,  demolitions  etc.  The  project 
is  using  Distributed  Intelligence  Data 
Operation  (DIDO)  as  the  method  of 
operation. 

The  ENQ  module  is  an  interpretative 
intelligent  interface  with  Natural 
Language  features,  operating  on  a 
reference  engine  ( SYS) .  The  SYS  module 
embodies  a  concatenation  (merging)  of 
existing  Thesauri  such  as  : 

BSI  ROOT,  TIT  and  ECCTIS. 

SYS  uses  an  interlexical  system  based 
on  concept  codes  which  operate 
multilingual. 
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An  Indexer-Thesaurus  should  not  be  used 
by  casual  Users  and  Novice  End-Users. 


4.  NATDIUU.  LANGUAGE  AMD  THESAURUS  AIDS 

Vocabulary  control  is  needed  to 
alleviate  the  matching  problems  caused 
by  the  use  of  Natural  Language  in 
retrieval  system  queries  and  indexing. 

The  problems  are  of  3  types  ; 
morphological,  syntactical  and  sonantic. 
Thesaurus  aids  can  be  used  to  resolve 
semantic  problems  :  control  of  homonyms 
or  synonyms  or  other  term  equivalencies , 
and  identification  and  classification 
of  related,  broader,  and  narrower  terms. 
Most  Thesauri  incorporate  both 
hierarchical  and  equivalence-type 
relationships.  Users  searching  under  a 
given  term  may  need  to  find  material 
indexed  under  an  equivalent  term, 
related  terms,  or  narrower,  more  precise 
terms  (HILD89) 

A  Thesaurus  normally  has  the  relations 
as  shown  in  fig  4  (AITC90)  (H1LS90). 

NATURAL  LANGUAGE  LEAD-IN  TERMINOLOGY 

A  Thesaurus  not  only  has  synonyms  but 
can  also  have  Lead-in  terminology.  In 
that  case  instead  of  a  non-preferred 
term  (the  lead-in  term)  a  preferred  terra 
should  be  used.  The  non-preferred  lead- 
in  term  is  a  Natural  Language  term  which 
leads  to  the  preferred  Descriptor.  This 
is  done  by  USE  and  USE  . .  AND 
relationships . 

MUNITIONS 
USE  Ammunition 

ARMOR  PIERCING  PROJECTILES 
USE  Armor  Piercing  Ammunition 
AND  Projectiles 

The  USE  . . .  AND  relationship  will  also 
help  the  End-User  to  understand  that 
he  has  to  combine  descriptors  using  the 
Boolean  operator  AND. 

The  more  Lead-In  terms  a  Thesaurus  has, 
the  lower  the  number  of  Descriptors 
can  be.  The  more  Natural  Language  Lead- 
In  terms  a  Thesaurus  has,  the  easier 
the  Thesaurus  will  be  for  a  End-Users. 
The  Thesaurus  then  becomes  an  End-User 
Thesaurus  (BATE86)  or  User  Thesaurus 
( BATE89 ) . 

A  Thesaurus  without  Natural  Language 
Lead-In  terminology  is  an  Indexer 
Thesaurus,  which  can  be  used  by  trained 
Information  Specialists. 


TOPICS 

Instead  of  a  Thesaurus,  Reid  suggests 
to  use  Topics ,  faceted  semantic  networlcs 
(or  clusters  of  some  50  terms)  (REID91) . 
Generally,  Topics  can  be  coti^iared  with 
the  Subject  Categories  that  exist  in 
several  Thesauri  such  as  INSPEC  etc .  The 
Topics  are  based  on  RUBRIC  concept- 
trees  (TONG85).  RUBRIC  (RUle-Based 
Retrieval  of  Information)  uses  a  set  of 
rules  to  make  a  heuristic  decision  tree 
containing  word  patterns .  RUBRIC  is  able 
to  weight  terms  according  to  their 
importance,  as  specified  by  the  users. 

A  unique  feature  "modifier  rules"  gives 
RUBRIC  synonym  knowledge  and  helps  to 
distinguish  among  multiple  meanings  of 
a  term  (HAWK88).  RUBRIC  combines 
knowledge-intensive  techniques  with 
efficient  full-text  retrieval  and 
ranking  strategies. 

The  RUBRIC  approach  seems  particularly 
suited  to  users  who  are  prepared  to 
spend  a  lot  of  effort  in  constructing 
queries  and  would  not  be  appropriate  in 
a  general  environment  ( BELK87 ) . 

Expert  users  can  build  up  libraries  of 
retrieval  topics,  and  Novice  users  can 
use  them  as  building  blocks  to  easily 
compose  powerful  queries  (table  V). 

Acquiring  the  concepts  and  their 
qualitative  and  quantitative  relation¬ 
ships  require  large  =unounts  of  effort. 
That '  s  why  the  CONSTRUCTOR  algorithm  was 
developed  by  Tong  to  automatically 
generate  relationships  between  concepts 
(building  probabilistic  networks  from 
data) .  The  CONSTRUCTOR  system  generates 
sparse  networks  (i.e.  with  few  arcs)  and 
can  find  subtle  relationships  that 
would  take  much  effort  to  find  manually. 
But  there  is  still  much  work  needed  by 
a  user  to  identify  which  Concepts  are 
present  (TONG90). 

But  if  two  people  or  groups  of  people 
construe  .  a  Thesaurus  in  a  given  area, 
only  60  %  of  the  index  terms  may  be 
common  in  tioth  Thesauri. 

And  if  two  scientists  or  engineers  are 
asked  to  judge  the  relevance  of  a  given 
set  of  documents  to  a  given  question, 
the  area  of  agreement  may  not  exceed 
60  %  (CLE/84). 
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INFORMATION  TREE 


HIERARCHIC  THESAURUS 


TERRORISM 
ASSASSINATION 
ASSASSINATE 
ENCOUNTER 
ATTACK 
BATTLE 
FIGHT 
SKIRMISH 
UPRISING 
EXPLOSION 
BLAST 
EXPLODE 
EXPLOSION 
KIDNAPPING 
ABDUCT 
HUACK 
KIDNAP 
KIDNAPPING 
RANSOM 
NAMED  ACTOR 
BASQUE 
ETA 
IRA 
PLO 

RED  ARMY 
RED  BRIGADE 


CIRCULAR  THESAURUS 


FIGHT 

BATTLE  SKIRMISH 


Figure  3 
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A  Thesaunis  nomufly  has  the  foda^Miig  reiaUons  (AL1C90)  : 


TT  -  TOP  TWM 

I 

BT  -  BROADER  TERM 

I 

SYNONYM  -  -  DE  -  DESCRIPTOR  -  -  RT  -  RELATED 


NT  -  NARROWER  TERM  -  RT 


LEAD-IN  TERMINOLOGY 


Thesaurus  (TEST) 

DRIT 

DTIC  Subject 

NATO  Thesaurus 

TT 

TT 

1 

SN 

TT 

1 

SN 

1 

1 

BT  — SC 

SYN  —  DE  —  RT 

1 

BT  BT 

U  —  DE 

1 

1 

SC  sc 

DE 

BT  BT  —  SC-SC 

U  —  DE —  RT 

1 

NT 

tJr 

NT  — 

SN 

SN  -  Scope  Note  (Definition) 

SC  -  Subject  Category  (50  terms) 

Polyhieiarchy  :  t  DE  ; 

1  BT  : 

2  BT 

2  SC 

GEOGRAPHY 

...  Europe 

.  Nelheilands  - Dutdi 

.  HoUand  (^)  speaMng 

counties 

.  CeideiiancI  _ _  Saxon 

dialect 


(planetary  ««diati6W) 


NASA  rritfH  lermR 


Figure  4 
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Table  V 


NATO  THESAURUS  SUBJECT  CATEGORIES 

COSATI  CODES  FIELDS 


01 

A 

B 

C12 

D 

E 

F 

AVIATION  TECHNOLOGY 

02 

A 

B 

C 

D 

E 

F 

AGRICULTURE 

03 

A 

B 

C 

ASTRONOMY  AND  ASTROPHYSICS 

04 

A 

B 

ATMOSPHERIC  SCIENCES 

05 

A 

B 

C 

D 

E 

F 

G 

H 

1 

BEHAVIORAL  AND  SOCIAL  SCI 

06 

A 

B 

C 

D 

E 

F 

G 

H 

1  J 

K 

L 

M 

N 

O 

BIOLOCMCAL  AND  MEDICAL  SCIENCE 

07 

A 

B 

C 

D 

E 

F 

CHEMISTRY 

08 

A 

B 

C 

D 

E 

F 

G 

H 

I  J 

K 

L 

EARTH  SaENCES  AND  OCEANOGRAPHY 

09 

A 

B 

C 

D 

E 

F 

G 

ELECTROTECHN  AND  FLUIDICS 

10 

A 

B 

C 

D 

POWER  PRODUCTION  AND  ENERGY 

11 

A 

B 

C 

D 

E 

F2 

G 

H 

1  J 

K 

L 

MATERIALS 

12 

A 

B 

C 

D 

E 

F 

G 

H 

1 

MATHEMATICS  AND  COMPUTER 

13 

A 

B 

c 

D 

E 

FI 

G 

H 

1  J1 

K 

L 

M 

MECHANICAL,  INDUSTRIAL,  CIVIL  AND 

MARINE  ENGIN 

14 

A 

B 

c 

D 

E 

TEST  EQUIPMENT,  RESEARCH 

IS 

A 

B 

C3 

D 

E 

F7 

MILITARY  SCIENCES 

16 

A 

B1 

C 

D3 

E 

GUIDED  MI^LE  TECHNOLOGY 

17 

A 

B 

C 

D4 

E2 

F 

G4 

H 

1  J 

K 

NAVIGATION  DETECTION 

18 

A 

B 

C 

D 

E2 

F 

G 

H 

1  J 

NUCLEAR  SCIENCE  AND  TECHNOLOGY 

19 

A1 

B 

C 

D 

E 

F 

G 

HI 

1  J 

K 

L 

M 

ORDNANCE 

20 

A 

B 

C 

D 

E 

FI 

G 

H 

1  J 

K 

L 

M 

N 

0 

PHYSICS 

21 

A 

B 

c 

D 

E 

F 

G 

H2 

12 

PROPULSION  ENGINES  FUELS 

22 

A 

B 

c 

D 

E 

SPACE  TECHNOLOGY 

23 

A 

B 

c 

D 

E 

F 

BIOTECHNOLOGY 

24 

A 

B 

c 

D 

E 

F 

G 

ENVIRONMENTAL  POLLUTION 

25 

A 

B 

c 

D 

E 

COMMUNICATIONS 

COSATI  STRUCTURE 

SUBJECT  CATEGORIES 


SUBJECT  CODES  SUBJECT  HEADINGS 

FIELDS  FIELDS 

GROUPS  GROUPS 


01 

01  A 

01  B 

01  C 

01  C  A 

01  C  B 

01  C  C 

01  C  D 


AVIATION  TECHNOLOGY 
AERODYNAMICS 

MILITARY  AIRCRAFT  OPERATIONS 
AIRCRAFT 

HEUCOPTERS 

BOMBERS 

ATTACK  AND  FIGHTER 
AIRCRAFT 

PATROL  AND  RECONNAIS¬ 
SANCE  AIRCRAFT 


01  C  L 


RESEARCH  AND 
EXPERIMENT  AIRCRAFT 


200  SUBJECT  CATEGORIES 


TEST  (1967) 

NATO  PCOOATABASE 
Wn  DATABASE 


225  SUBJECT  CATEGORIES  SIGLE  DATABASE 


250  SUBJECT  CATEGORIES  OTIC 
OF  50  DESCRIPTORS  EACH  NATO 
:  12J00  DESCRIPTORS 


250  TOPICS 
OF50KEYVYOROS 

250  BAGS 
OF  50  MARBLES 


350  SUBJECT  CATEGORIES  NTIS  (SRIM) 
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Topics ,  Subject  Categories  or  Thesauri 
Indexing  and  Searching  can  be  difficult. 

KNOWLEDGE  GRAPHS 

Knowledge  graphs  can  represent  many 
semantic  structures  found  in  Natural 
Language,  Not  only  the  structure  of 
words  can  be  represented,  even  the 
structure  of  sentences  {H0ED89). 
Integrated  knowledge  graphs  can  be  used 
as  an  expert  sysytem.  The  use  of 
knowledge  graphs  may  help  a  knowledge 
engineer  to  build  an  expert  system  by 
extracting  knowledge  from  texts  written 
by  experts  without  having  to  consult 
them  in  person. 

Knowledge  graphs  have  been  used  for  a 
medical  diagnosis  system  MEDES  and  for 
describing  the  sulphur  cycle  in  the  Ems 
estuary  (JAMESO). 

INDEX  EXPRESSIONS 

Index  e^resslona  have  been  used  to  show 
relations  between  concepts  (BRT]Z90,91) . 

4.1  MULTILINGUAL  THESAURI 


An  Expert  User  normally  cein  work  with 
an  Indexer  Thesaurus  without  Lead-In 
Terminology,  but  sometimes  even  an 
Expert  User  needs  help.  That’s  the  case 
when  an  US  Expert  User  or  an  European 
End-User  wants  to  search  in  European 
databases .  Because  of  the  language 
barrier  they  can't  choose  good 
Descriptors  for  searching  in  French  or 
German  databases.  That's  why  they  need 
a  multilingual  Thesaurus  to  search  in 
European  databases. 

Multilingual  Thesauri  are  available  for 
multilingual  databases  such  as  :  IRRD, 
EUDISED. 

The  French  PASCAL  database  has  a 
Multilingual  Controlled  Vocabulary 
called  PASCAL  LEXIQUE,  which  contains 
80.000  controlled  keywords  in  French, 
English  and  Spanish.  French  keywords 
have  been  used  from  the  start  of  the 
database,  but  English  keywords  were  only 
used  since  1984. 

UNESCO  maintains  a  multilingual 
nultidisciplinary  SPINES  Thesaurus  which 
is  being  used  for  the  SESAME  database 
of  CEC  DG  XVII  on  the  Eurobases  host. 


The  EURODICAUTOM  database  of  the  ECHO 
hostcomputer  of  CEC  DG  XIII  contains  the 
multidisciplinary  multilingual 
terminology  that  is  being  used  by  CEC 
for  automatic  SYSTRAN  translation. 

The  EUDISED  Thesaurus  is  available  in 
9  European  languages  and  is  being  used 
for  the  EUDISED  database  on  the  ESA 
hostcCTBputer.  The  EUDISED  Thesaurus  has 
some  3700  descriptors,  with  a  graphical 
display.  2500  of  the  descriptors  might 
be  comparable  or  synonymous  with  2500 
descriptors  from  the  ERIC  Thesaurus  and 
other  Education-related  Thesauri.  When 
these  Thesauri  would  become  integrated, 
or  linked,  the  EUDISED  Thesaurus  might 
become  an  multilingual  Interface  to 
ERIC  and  Education  related  databases. 

EUDISED  25 

French 

250 

English 

EUDISED  2500  ===  2500  ERIC 

+  1200  —  25000 

4.2  BILINGUAL  NATO  THESAURUS 

In  1987  the  NATO  Standardization  Group 
(NSG)  felt  the  need  for  a  NATO 
Standardization  Information  Base  (NSIB), 
which  should  give  information  about 
ongoing  Standardization  activities .  The 
NSG  wanted  to  standardize  the 
terminology  by  using  a  bilingual 
English/French  Thesaurus.  An  AC/315  NSG 
Ad-Hoc  Working  Group  advised  in  1988  to 
use  the  DTIC  Retrieval  and  Indexing  Ter¬ 
minology  (DRIT)  as  the  baseline  of  the 
NATO  Thesaurus  (C0TT89),  In  1989  a 
Thesaurus  Steering  Group  was  formed. 
This  group  decided  that  the  NATO 
Thesaurus  should  be  a  combination  of  the 
DRIT  and  the  DTIC  Subject  Catego¬ 
rization  Guide,  which  is  used  to 
distribute  microfiches  of  reports 
(KRUE90). 

By  ccxnbining  the  DRIT  with  the  DTIC 
Subject  categories  the  structure  of  the 
NATO  Thesaurus  became  similar  to  the 
structure  of  the  TEST  Thesaurus. 

The  MATO/DTIC  Subject  Categories  are 
related  to  the  SRIM  Subject  Categories 
of  the  NTIS  database  2md  to  the  Subject 
Categories  of  the  NATO-PCO,  SIGLE  and 
WTI  databases,  which  are  also  based  on 
the  COSATI  Siibject  Categories  of  the 
TEST  Thesaurus  (TEST67). 
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Because  the  NATO  Thesaurus  allows  for 
Lead-In  terms.  Natural  Language 
Lead-In  terms  can  be  integrated  in  the 
NATO  Thesaurus. 

Because  the  NATO  Thesaurus  allows  for 
Scope  Notes  for  Subject  Categories  and 
Descriptors,  the  Scope  Note  facility  can 
be  used  to  integrate  Natural  Language 
Definitions  of  Descriptors  or  Subject 
Categories  in  the  NATO  Thesaurus. 
Definitions  might  be  integrated  at  the 
Scope/Hetadata  level  which  exists  in  the 
NASA  Thesaurus  :  2500  terms  (NASA89). 

In  this  case  the  NATO  Thesaurus  will 
become  a  real  End-User  Thesaurus. 
Unfortunately  the  AVOCON  Thesaurus 
module  which  was  used  did  not  support 
the  USE  . . .  AND  . . .  operator,  so  all 
DRIT  USE  .  . .  AND  terms  had  to  be 
skipped.  Generally,  when  software  for 
Information  Retrieval  is  chosen,  the 
relationships  of  the  Thesaurus  module 
should  be  investigated. 

An  overview  of  possible  relationships 
in  Thesaurus  software  has  been  given  by 
Milstead  (MILS90). 

NATO  is  investigating  the  possibility 
to  introduce  an  USE  ...  +  Scope  Note 
instead  of  USE  . . .  AND  . . .  Because  the 
AVOCON  only  supports  a  classification 
code  of  4  digits,  the  DTIC  Subject  Code 
of  6  digits  was  transformed  in  an  easy 
to  remember  4  digit  Subject  Code  : 
15.01.02  became  15AB. 

AMENDMENTS 

Related  terms  were  not  available  in 
DRIT,  but  should  be  available  in  the 
NATO  Thesaurus.  An  ej^eriment  was  made 
to  extract  Related  terms  from  the  NASA 
Thesaurus.  NATOTerminology froroSubject 
Category  01  Aerospace  was  transformed 
in  NASA  Descriptors  using  the  NASA/DoD 
Switching  language.  The  Related  terms 
of  these  NASA  descriptors  were 
integrated  and  transformed  back  to  NATO 
descriptors.  The  result  was  printed  but 
could  not  be  used  because  the  NASA 
Thesaurus  contains  too  many  Related 
terms 

The  NATO  Thesaurus  was  passed  through 
a  English  spelling  checker  which 
identified  Latin  and  American 
terminology. 


The  American  term  has  become  a 
Lead-In  term.  European  terminology  can 
also  be  integrated  as  Lead-In  term. 

US: 

ARMOR 
USE  Armour 

NATO: 

INSENSITIVE  MUNITIONS 
USE  Insensitive  explosions 

EUROPE: 

LOW  VULNERABILITY  AMMUNITION 
USE  Insensitive  explosives 

LINKING  AND  COMPATIBILITY 

The  system  manager  of  the  NATO  Thesaurus 
has  made  a  listing  of  the  number  of 
terms  for  each  Subject  Category,  which 
has  been  combined  with  data  about  the 
number  of  AD-A  reports  in  the  NTIS 
Ondisc  CD-ROM  for  1990.  This  list  shows 
for  which  Subject  Categories  new 
terminology  should  be  added. 

By  linking/merging  terminology  from 
other  Thesauri  the  compatibility  with 
external  databases  can  be  promoted. 
Integrating  and  linking  of  multiple 
Thesauri  is  advocated  by  Alberico  and 
Micco  (ALBE90) .  Techniques  for  doing  so 
are  discussed  by  Mandel  (MAND87). 

In  some  cases  a  descriptor  from  an 
external  database  might  become  a  NATO 
Lead-In  term. 

TRANSLATION 

The  translation  of  the  NATO  Thesaurus 
from  English  into  French  in  1990  has 
been  a  complicated  process,  because  of 
the  many  partners  that  were  involved. 
Tapes  in  different  formats  with 
different  run-dates  have  givenproblems . 
But  the  problems  have  been  solved  and 
links  between  Enlish  and  French 
descriptors  have  been  introduced.  A 
toggle  switch  (function  key)  has  been 
developed  for  online  switching  between 
English  andFrench  (online  treuislation) . 

FRIMTKD  EDinOH 

T)w  edit  copy  of  tba  Bngllsh  veralon  of 
tbe  miO  TheMunia  is  availeble,  aa  well 
M  Ingliah  MIO  IhoMuraa  on  diskette. 
Printed  editi<m  :  Jmisry  92. 
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MWBX 


NL  SEASCB  STRATEGY  : 

BRELLIGBIIT  IMKffiMATIOII  RETRIEVAL 

When  online  searching  is  done  by 
intermediaries,  the  requester  has  to 
explain  what  he  needs  to  the  intemedia- 
ry.  During  the  interview  the  requester 
begins  better  to  understand  and  describe 
the  problem.  When  there  is  no  interview 
the  requester  has  to  fill  in  a  form.  The 
requester  is  asked  to  e}q>lain  the 
purpose  of  the  request,  and  generally 
he  is  asked  to  mention  a  "good  article" 
as  an  exan^le. 

When  an  untrained  End-User  wants  to  find 
information  about  a  certain  subject  he 
sometimes  does  not  exactly  know  what  he 
wants  and  then  he  should  also  start  with 
a  good  article  as  an  example. 

Because  the  title  of  an  article,  a 
report  or  a  book  is  written  in  Natural 
Language,  a  search  strategy  in  Natural 
lianguage  generally  can  be  based  on  (a 
part  of )  a  title.  So  an  untrained  End- 
User  can  also  search  online  in  external 
databases,  starting  with  Natural 
Language,  without  p€u:sing,  NLP. 

Because  this  TIP  specialist  meeting  is 
related  to  Artificial  Intelligence, 
Intelligent  Gateways  and  Information 
Retrieval,  a  good  example  of  Natural 
Language  is  : 

"Intelligent  Information  Retrieval". 

There  are  several  articles  and  papers 
with  this  title  or  with  a  title  which 
contains  this  multiword  subject.  Maybe 
our  untrained  End-User  has  seen  an 
article  about  HR  and  wants  to  know  more 
about  it. 

But  :  Intelligent  Information  Retrieval 
is  not  a  Descriptor. 

Which  databases 

Diallndex  can  be  used  to  see  which 
databases  contain  Information  about  HR. 


Options  are  : 

?sf  all  =  352  files 

?sf  all  social  =  42  files 
?sf  all  infosci  =  7  files  only 

s  intelligent  information  retrieval 
s  intelligent(w)information(w)retrieval 

s  intelligent  AND 

information  AND  retrieval 
s  intelligent  AMD 

inf ormation( w ) retrieval 

s  intelligent{ IN) 

information( lN)retrieval 

The  result  is  shown  in  table 

The  18  hits  in  INSPEC  have  all  been 
indexed  with  the  identifier  HR, 
but  60  %  of  the  317  I+I+R  hits  in 
Information  Science  Abs  are  not- 
relevant . 

The  untrained  End-User  will  start  a 
Natural  Language  I  I  R  search  in  INSPEC 
(b2)  and  print  10  records.  He  will 
probably  be  happy  with  5  of  them. 

The  Subject-Expert  will  start  a 
clustersearch  IwIwR  in  the  databases 
INSPEC,  COHPENDEX,  ISA,  LISA,  ERIC,  and 
NTIS  (b  1,  2,  6,  8,  61,  202) 
and  find  some  140  records.  After  online 
removal  of  duplicates  with  (rd) 
he  will  print  100  records. 

The  Information  Expert  will  build  a 
strategy  : 

concept  I  :  Information  Retrieval 

concept  H  ;  Artificial  Intelligence, 
which  contains  E]q>er t  Systems ,  Knowledge 
Based  Systems,  Natural  Language  and  the 
word  Intelligent; 

He  will  probably  search  in  some  15 
databases  on  Dialog  and  maybe  several 
databases  of  other  Hostconputers . 

Costs 

End-User  $  10-20 

Subject-Expert  $  100-200 

Information  Specialist  $  1000-2000 


4-29 


ESA  Which  databases? 

Questindex  Topic  :  Information  Science  INSPEC:  IR  AND  ES  AND  AI  48  10 


I  I  R  IwIwR  l4IwR 

(scxnetimes  with  f) 


Total  :  20 

Costs  : 


INSPEC 
Pascal  205 
Pascal  204 
ABI 
NTIS 
NASA 


36  36  384 

11  11  30 

14 

9  9  79 

4  4  84 

2  2 

$  0,3  $  0,6  $  0,6 


Not  in  Topic  Information  Science  : 
COMPENDEX  225 


Which  keywords? 

HYPERLINE  shows  which  keywords  are 
useful.  The  "get"  command  can  be  used 
to  transfer  the  keywords  to  search 

NTIS 


HL  intelligent  information  retrieval 


information  retrieval 

IR 

expert  systems 

ES 

artificial  intelligence 

AI 

databases 

user  interfaces 

A  good  strategy  would  be 


The  Subject-Expert  starts  a  search  in 

NTIS  :  IR  AND  (ES  OR  AI)  247  50 

INSPEC:  IR  AND  (ES  OR  KBS)  368 

:  IR  AND  Intelligent  384 
:  (I+IwR) 

:  IR  AND 

(AI  OR  ES  OR  KBS)  471 
:  IR  AND 

(AI  OR  ES  OR  KBS) 

(OR  Intelligent  )  601  50 

ABI/INFORM 


:  (I+IwR) 

79 

10 

COMPENDEX 

:  (I+IwR) 

219 

10 

LISA  ;  (I*IwR) 

289 

10 

Total  : 

130 

Costs  : 


(IR)  AND  (ES  OR  AI) 

INSPEC 

HL  intelligent  information  retrieval 

information  retrieval  IR 

information  retrieval  systems 
expert  systems  ES 

user  interfaces 

knowledge  based  systems  KBS 

Inspec  is  more  specific  then  NTIS. 

A  good  strategy  would  be 

(IR)  AND  (ES  OR  KBS) 

HYPERLINE  allows  the  End-User  to  start 
a  search  in  Natural  Language  (HR)  and 
then  find  the  relevant  descriptors. 

The  End-User  will  start  a  very  sia^le 
search  in 

Hits  Prints 

NTIS  :  IR  AND  ES  AND  AI  23  10 


The  documents  that  are  found  by  the 
Subject-Expert  will  be  quite  new 

This  is  a  good  result  for  End-User  and 
Subject-Specialist . 

The  Information  Specialist  will  do  it 
better  and  will  make  a  ZOOM  (table  8). 

INSPEC  also  mentions  Easynet  (10), 
PLEXUS  (5),  RUBRIC  (4),  SAFIR  (4), 
CODER  (3),  CONIT  (3),  EURISKO  (2),  KISIR 
(2),  NORDINFO  (2),  SPIRIT  (2) 

When  he  has  not  found  enough  Descriptors 
he  can  navigate  in  INSPEC  Thes 

hi  Artificial  Intelligence  :  11861 
3816  Knowledge  Engineering 
17128  E]q>ert  Systems 
4718  Learning  Systems 
3446  Natural  Languages 
7381  Neural  Nets 
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hi  Natural  Languages  :  3446 

11861  Artificial  Intelligence 
2062Coii^utationalLinguistics 
9836  User  Interfaces 
15307  Database  Management 
Systems 

3109  Information  Retrieval 

hi  Expert  Systems  :  17128 

11861  Artificial  Intelligence 
4209  Decision  Support  Systems 
21738  Explanation 
1638  Knowleage  Acquisition 
4903  Knowledge  Based  Systems 

hi  Knowledge  Based  Systems  :  4903 

17128  Expert  Systems 
3037  Knowledge  Representation 

The  Information  Specialist  will  build 
a  search  profile 

Concept  1  :  Information  Retrieval 

Concept  2  :  Artificial  Intelligence 
Computational  Linguistics 
Expert  Systems 
Knowledge  Based  Systems 
Natural  Language 
Intelligent 

After  saving  the  search  profile,  it  will 
be  executed  using  Clustersearch  or 
Onesearch  in  all  relevant  databases  on 
several  Hostcomputers.  Duplicates 
will  be  deleted  online,  and  then  some 
1000  records  will  be  printed. 

The  information  specialist  will  also 
make  a  ZOOM  on  Authors  in  ESA  (table  9)  . 
or  a  ZOOM  on  Journal  names. 

He  can  introduce  the  names  of  these 
authors  in  the  Pascal  databases 
and  then  combine  all  relevant  authors 
in  a  set  of  over  100  hits  and  then  make 
a  ZOOM  on  descriptors,  to  find  relevant 
French  Descriptors. 

He  can  also  search  in  Pascal  204  and  205 
with  Search  Strategy 

Concept  1  :  Information  Retrieval 


90  hits  :  ZOOM 

90  Recherche  Information 
89  Information  Retrieval 
82  Intelligence  Artificielle 
81  Artificial  Intelligence 
54  Recuperation  Informacion 
51  Inteligencia  Artificial 
18  Information  System 
16  Base  Donnee 

16  Representation  Connaissances 
15  Knowledge  representation 
15  Langage  Nature! 

15  Representacion  Conocimientos 
14  Natural  Language 
14  Sistema  Informacion 
13  Sistema  Experto 
13  Sy Sterne  Expert 
12  Expert  System 

Alter  entering  these  Descriptors  he  can 
build  a  French  language  Search  strategy: 

Concept  1  : 

Recherche  Information 
Concept  2  : 

Intelligence  Artificielle 
Representation  Connaissances 
Systeme  Expert 
Base  Connaissance 
Langage  Natural 

Which  results  in  183  hits,  twice  as  much 
as  with  the  English  language  Search 
Strategy. 


Concept  2  :  Artificial  Intelligence 
Knowledge  Based  Systrans 
E:q>ert  Systems 


90  hits 


?sf  all  =  352  files 

?sf  allsocial  =  42  files 

?sf  all  infosci  =  7  files, 

s  intelligent  information  retrieval  I  I  R  (NATURAL  LANGUAGE) 

s  intelligent(w)information(w)retrieval  IwIwR 

s  intelligent  AND  information  AND  retrieval  I+I+R 

s  intelligent  AND  information(w)retrieval  I+IwR 

s  intelligent( IN) information( lN)retrieval  ININR 


I  I  R 

IwIwR 

I+I+R 

ININR 

I+IwR 

all  infosci 

all  social 

all 

all 

INSPEC 

18 

60 

385 

LISA 

0 

15 

295 

20 

289 

COMPENDEX 

23 

219 

Trade  &  Indust 

ASAP 

158 

SCISEARCH 

141 

Information  Sci 

Ab  0 

23 

317 

95 

NTIS 

0 

6 

5 

85 

ABI/ INFORM 

79 

ERIC 

0 

12 

58 

50 

Pascal 

34 

I  I  R 

IwIwR 

I+I+R 

ININR 

I+IwR 

Costs 

$  0,25 

$  1,82 

$  1,82 

$  5,50 

TABLE  2  ZOOM 


NTIS  :  ABI/COMPENDEX  INSPEC 


247  hits 

304  hits 

601  hits 

Information  Retrieval  Systems 

199 

130 

249 

Information  Retrieval 

87 

761 

201 

Artificial  Intelligence 

97 

200 

68 

Expert  Systems 

56 

376 

Expert  System 

50 

User  Interfaces 

165 

User  Interface 

11 

46 

Natural  Languages 

53 

Hypermedia 

23 

Natural  Language  Processing 

3 

19 

Online  Searching 

39 

22 

28 

Con5>utational  Linguistics 

20 

Fuzzy  Set  Theory 

19 

15 

Knowledge  Bases 

6 

24 

12 

Heuristic  Methods 

Intelligent  Gateway 

5 

Intelligent  Information  Retr 

4 

10 

12 

Knowledge  Based  Systems 

4 

102 

Knowledge-Based  Systems 

4 

12 

Knowledge  Representation 

44 
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Non  Boolean  Search  Methods  in  Information  Retrieval 

C.  J.  van  Rijsbergen 
University  of  Glasgow 
Depaitnttnt  Computing  Science 
Glasgow,  G12  SfiZ 


Introduction 

Infonnadon  retrieval  is  a  wide,  often  loosely- 
defined  term  but  in  these  pages  I  shall  be 
concerned  only  with  automatic  information 
retrieval  systems.  Automatic  as  opposed  to 
manual  and  infoimation  as  opposed  to  data  or 
fact.  Unfortunately  the  word  information 
can  be  very  misleading.  In  the  context  of 
information  retrieval  (IR),  information,  in  the 
technical  meaning  given  in  Shannon's  theory 
of  communication,  is  not  readily  measured. 
Nevertheless  it  has  become  apparent  that 
there  is  a  notion  of  information  fundamental 
to  the  information  retrieval  process  that 
underlies  our  intuitions  about  what  it  is  we 
attempt  to  retrieve  (see  Van  Rijsbergen, 
1989).  In  many  cases  one  can  adequately 
describe  this  kind  of  retrieval  by  simply 
substituting  'document'  for  'information' 
where  a  document  may  be  text,  image  etc.. 
This  implies  that  the  process  is  concerned 
with  the  identification  of  certain  kinds  of 
objects,  viz.  documents.  It  does  not  explain 
what  is  the  basis  of  this  identification 
process,  and  here  it  is  that  the  notion  of 
information  plays  a  role.  In  the  case  of 
Boolean  retrieval,  an  attempt  is  made  to 
measure  the  extent  to  which  the  information 
is  contained  in  a  particular  document  by 
establishing  whether  a  document  satisfies  tire 
request.  In  the  non-Boolean  case,  this 
process  of  satisfaction  ha  3  a  measure  of 
uncertainty  attached  to  it.  For  example,  one 
may  wish  to  express  this  uncertainty  through 
a  process  of  plausible  inference. 

All  search  strategies  are  based  on  comparison 
between  the  query  and  the  stored  documents. 
Sometimes  this  comparison  is  only  tthieved 
indirectly  when  the  query  is  compared  with 
clusters  (or  m<ree  pimsely  with  the  profiles 
representing  the  clusters).  Or  indeed, 
sometimes  the  comparison  is  based  on  a 
compvison  with  docwnents  within  a  context 
or  neighbourhood  of  a  pven  document. 
Frequently  the  comparison  is  iterative  in  that 
a  user  provides  feedback  after  a  Brst 
comparison  which  will  affect  the  next 
comparison. 


The  distinctions  made  between  different 
kinds  of  search  strategies  can  sometimes  be 
understood  by  looking  at  the  query  language, 
that  is,  the  language  in  which  the  infoimation 
need  is  expressed.  The  nature  of  the  query 
language  often  dictates  the  nature  of  the 
search  strategy.  For  example,  a  query 
language  which  allows  search  statements  to 
be  expressed  in  terms  of  logical  combinations 
of  keywords  normally  dictates  a  Boolean 
search.  This  is  a  search  which  achieves  its 
results  by  logical  (rather  than  numerical) 
comparisons  of  the  query  with  the 
documents. 


Boolean  search 

A  Boolean  search  strategy  retrieves  those 
documents  which  are  'true'  for  the  query. 
This  formulation  only  makes  sense  if  the 
queries  are  expressed  in  terms  of  index  terms 
(or  keywords)  and  combined  by  the  usual 
logical  connectives  AND,  OR,  and  NOT. 
For  example,  if  the  query  Q  =  (if  1  AND  K2) 
OR  (ATs  AND  (NOT  K4))  then  the  Boolean 
search  will  retrieve  all  documents  indexed  by 
Ki  and  K2,  as  well  as  all  documents  indexed 
by  Kj  which  are  not  indexed  by  K4. 

An  obvious  way  to  implement  the  Boolean 
search  is  through  the  inverted  file.  We  sUxe 
a  list  for  each  keyword  in  the  vocabulary,  and 
in  each  list  put  the  addresses  (or  numb^)  of 
the  documents  containing  that  particular 
word.  To  satisfy  a  query  we  now  perform 
the  set  c^rations,  corresponding  to  the 
logical  connectives,  on  the  ATi-lists.  For 
example,  if 

AT] -list  :  D\,D2,Di,D4 
K2  -list  :  Di,  D2 
Ar3-list  :  D\,D2,D'i 
ATa-list  :  D\ 

and  Q  ={K\  ANDA:2)  OR  (ATs  AND  (NOT  Ka)) 

then  to  satisfy  the  (ATj  AND  ATj)  part  we 
intersect  the  k\  and  K2  lists,  to  satisfy  the 
(ATs  AND  (NOT  Ka))  part  we  subtract  the  X4 
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list  from  the  K'i  list.  The  OR  is  satisfied  by 
now  taking  the  union  of  the  two  sets  of 
documents  obtained  for  the  parts.  The  result 
is  the  set  [Di,D2,  D-^]  which  satisfies  the 
query  and  each  document  in  it  is  'true'  for  the 
query. 

A  slight  modification  of  the  full  Boolean 
search  is  one  which  only  allows  AND  logic 
but  takes  account  of  the  actual  number  of 
terms  the  query  has  in  common  with  a 
document.  This  number  has  become  known 
as  the  co-ordination  level.  The  search 
strategy  is  often  called  simple  matching. 
Because  at  any  level  we  can  have  mtxe  than 
one  document,  the  documents  are  said  to  be 
partially  ranked  by  the  co-ordination  levels. 

For  the  same  example  as  before  with  the 
query  Q=K\  AND  ATj  AND  K‘i  we  obtain 
the  following  ranking; 

Co-ordination  level 

3  D1.D2 

2  03 

1  D4 


In  fact,  simple  matching  may  be  viewed  as 
using  a  primitive  matching  function.  For 

each  document  D  we  calculate  lO  n  Q\,  that  is 
the  size  of  the  overlap  between  D  and  Q>  each 
represented  as  a  set  of  keywords. 

Another  modification  to  simple  Boolean 
matching  derives  from  Fuzzy  Logic.  This 
work  was  pioneered  by  Zadeh  (1965)  and 
convincingly  characterised  by  Bellman  and 
Giertz  (1973).  In  information  retrieval 
several  authors  have  worked  applying  the 
techniques,  e.g.  Bookstein  (1980),  Radecki 
(1979).  TTierc  are  fuzzy  versions  of  the 
conventional  set  iterations.  If  a  document  is 
in  set  A  to  degree  a,  and  in  set  0  to  degree  b, 
where  a  is  greater  than  b,  then  it  is  in: 

-  the  union  of  A  andB  to  the  degree  a 

-  the  intersection  of  A  andB  to  the  degree  b 

-  the  complement  of  A  to  depee  I-a,  and 

the  complement  of  0  to  wgiee  \-b. 


matching  function.  This  is  a  function  similar 
to  an  association  measure,  but  differing  in 
that  a  matching  function  measures  the 
association  between  a  query  and  a  document 
or  cluster  profile,  whereas  an  association 
measure  is  applied  to  objects  of  the  same 
kind.  Mathematically  the  two  functions  have 
the  same  properties;  they  only  differ  in  their 
inteipretations. 

There  are  many  examples  of  matching 
functions  in  the  literature.  Perhaps  the 
simplest  is  the  one  associated  with  the  simple 
matching  search  strategy. 

If  M  is  the  matching  function,  D  the  set  of 
keywords  representing  the  document,  and  Q 
the  set  representing  the  query,  then: 


210  r^a 

M  =  - — 

101  +  10 

is  another  example  of  a  matching  function. 
It  is,  of  course,  the  same  as  Dice's  coefficient 
of  a  well  known  coefficient  from  the 
numerical  taxonomy  literature  (Sneath  and 
Sokal,  1973). 


A  popular  one  used  by  the  SMART  project 
(Salton,  1971),  which  they  call  cosine 
correlation,  assumes  that  the  document  and 
query  are  represented  as  numerical  vectors  in 
f-space,  that  is  Q  =  (qi,  n,  ■  ■ ,  qd  and  O  = 
(dl,  d2,  .  ■  .,  dt)  where  q,-  and  d/  are 
numerical  weights  associated  with  the 
keyword  1.  The  cosine  correlation  is  now 
simply 


I 


or,  in  the  notation  for  a  vector  space  with  a 
Euclidean  norm. 


r  = 


(Q.D) 
Hgi  IIDH 


=  cosine  0 


Matching  functions 

Many  of  the  more  sophisticated  search 
strategies  are  implemented  by  means  of  a 


where  0  is  the  angle  between  vectors  Q  and 
D.  The  norms  do  not  need  to  be  the 
Euclidean  norm;  Salton  (1989)  has 
investigmed  a  range  of  diffmnt  nonns. 
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Serial  search 

Although  serial  searches  are  acknowledged  to 
be  slow,  diey  are  fiequendy  still  used  as  parts 
of  larger  systems.  They  also  provide  a 
convenient  demonstration  of  the  use  of 
matching  functions.  More  importantly,  they 
allow  the  specification  of  query  and 
document  representation  without  paying 
much  attention  to  efficiency  considerations. 
This  means  that  once  a  matching  functitm  has 
been  defined,  consideration  can  be  given  to 
speeding  up  its  execution  using  parallel 
architectures  without  sacriflcing  any  of  the 
complexity  and  accuracy  of  the  representation 
mechanisms.  There  are  now  a  number  of 
parallel  architectures  (Rasmussen.  1991) 
used  to  increase  the  efticiency  of  retrieval 
engines. 

Suppose  there  are  N  documents  Di  in  the 
system,  then  the  serial  search  proceeds  by 
calculating  N  values  M(Q.  Di)  from  which 
the  set  of  documents  to  be  retrieved  is 
determined.  There  are  two  ways  of  doing 
this: 

(1)  the  matching  function  is  given  a 
suitable  threshold,  retrieving  the 
documents  above  the  threshold  and 
discarding  the  ones  below.  If  T  is 
the  threshold,  then  the  retrieved  set 
B  is  the  set  {Df  IM(Q,  Dj)  >  T}. 

(2)  the  documents  are  ranked  in 
increasing  order  of  matching 
function  v^ue.  A  rank  position  R  is 
chosen  as  cut-off  and  all  documents 
below  the  rank  are  retrieved  so  that 
B  =  {Di  \iii)  <  B)  where  r(0  is  the 
rank  position  assigned  to  £){.  The 
hc^  in  each  case  is  that  the  relevant 
documents  are  contained  in  the 
retrieved  set 

The  main  difficulty  with  this  kind  of  search 
strategy  is  the  specification  of  the  threshold 
or  cut-off.  It  will  always  be  arbitrary  since 
there  is  no  way  of  telling  in  advance  what 
value  for  each  query  will  produce  the  best 
retrieval. 

The  advantage  of  a  serial  search  is  of  course 
that  the  ma^mg  fuiKiion  can  be  made  as 
comptex  as  one  wishes  without  any  concern 
for  invenum  to  speed  up  access.  M«e 
recent  research  using  parallel  finegnin^ 
SIMD  architectures  (Stanfill  and  Kahke, 
1986),  uses  what  amounts  to  a  serial  search 


except  that  it  is  implemented  in  paralleL  It 
allows  the  use  of  signatures  derived  from 
superimposed  coding  for  the  reinesentation 
of  documents.  These  signatures  are  matched 
in  their  entirety  against  a  q<iery.  The  best 
scoring  documents  are  retrieved. 


Probabilistic  Retrieval 

The  basic  instrument  we  have  fOT  trying  to 
separate  the  relevant  from  the  non-relevant 
documents  is  a  matching  function.  The 
reasons  for  picking  any  particular  matching 
function  have  never  bran  made  explicit.  In 
fact,  mostly  they  are  based  on  intuitive 
argument  in  conjunction  with  Ockham's 
Razor.  Simple  probability  theory  can  tell  us 
what  a  matching  function  should  lotdc  like 
and  how  it  should  be  used.  The  arguments 
are  mainly  theoretical  but,  in  my  view,  fairly 
conclusive  (van  Rijsbergen,  1979).  The 
only  remaining  doubt  is  about  the 
acceptability  of  the  assumptions,  which  I 
shall  briefly  discuss.  The  data  used  to  fix 
such  a  matdiing  function  are  derived  from  the 
knowledge  of  the  distribution  of  the  index 
terms  throughout  die  collection  of  documents 
or  some  subset  of  it  If  it  is  defined  on  some 
subset  of  documents  then  this  subset  can  be 
defined  by  a  variety  of  techniques:  sampling, 
clustering,  or  trial  retrieval.  The  data  thus 
gathered  are  used  to  set  the  values  of  certain 
parameters  associated  with  the  matching 
function.  Gearly,  should  the  data  contain 
relevance  information,  then  the  process  of 
defining  the  matching  function  can  be  iterated 
by  some  feedback  mechanism  similar  to  the 
one  due  to  Rocchio  described  later  in  this 
paper.  In  this  way,  the  parameters  of  the 
matching  function  can  be  'learnt'.  It  is  on 
matching  functions  derived  from  relevance 
information  that  we  shall  concentrate. 

It  will  be  assumed  in  the  sequel  that  the 
documents  are  described  by  binary  state 
attributes,  that  is,  absence  or  presence  of 
index  terms.  This  is  not  a  restriction  on  the 
theory;  in  principle  the  exiennon  to  arbitrary 
attributes  can  be  worked  out,  although  it  is 
not  clear  that  this  would  be  worth  doing 
(Osborne,  1975) 

When  we  search  a  document  collection,  we 
attempt  to  retrieve  relevant  documents 
without  retrieving  non-rdevant  ones.  Since 
we  have  no  oracle  which  will  tell  us  without 
fail  which  documents  are  relevuit  and  which 
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are  non-ieievanc,  we  must  use  imperfect 
knowledge  to  guess  for  any  given  document 
whether  it  is  relevant  or  non-relevant. 
Without  going  into  the  philosophical 
para^xes  associated  with  relevance,  I  shall 
assume  that  we  can  only  guess  at  relevance 
through  summary  data  about  the  document 
and  its  relationships  with  other  documents. 
This  is  not  an  unreasonable  assumption, 
particularly  if  one  believes  that  the  only  way 
relevance  can  ultimately  be  decided  is  for  the 
user  to  read  the  full  text.  Therefore,  a 
sensible  way  of  computing  our  guess  is  to  ^ 
and  estimate  for  any  document  its  probabili^ 
of  relevance 

pQ  (lelevance/document) 

where  the  Q  is  meant  to  emphasire  that  it  is 
for  a  specific  query.  It  is  not  clear  at  all  what 
kind  of  probability  this  is  (see  Good,  19S0, 
for  a  delightful  summary  of  different  kinds), 
but  if  we  are  to  make  sense  of  it  with  a 
computer  and  the  primitive  data  we  have,  it 
must  surely  be  one  based  on  frequency 
counts.  Thus  our  probability  of  relevance  is 
a  statistical  notion  rather  than  a  semantic  one, 
but  I  believe  that  the  degree  of  relevance 
computed  on  the  basis  of  statistical  analysis 
will  tend  to  be  very  similar  to  one  arrived  at 
on  semantic  grounds.  Just  as  a  matching 
function  attaches  a  numerical  score  to  each 
document  and  will  vary  from  document  to 
document,  so  will  the  probability,  fat  some  it 
will  be  greater  than  fo  others  an^  of  course, 
it  will  (tepend  on  the  query.  The  variation 
between  queries  will  Ire  ignored  fat  now,  it 
only  becomes  impotant  at  the  evaluation 
stage.  So  we  will  assume  only  one  query 
has  been  submitted  to  the  system  and  we  are 
concerned  with 

P  (relevance/document). 

Let  us  now  assume  (following  Robertson, 
1977)  that: 

(t)  The  relevance  of  a  document  to  a 
request  is  independent  of  other 
doraments  in  the  collection. 

With  this  assumption  we  can  now  state  a 
principle,  in  terms  of  probability  of 
relevance,  which  shows  that  probalnlistic 
infonnation  can  be  used  in  an  optimal  manner 
in  retrieval.  Robertson  attributes  this 
principle  to  W.  S  Cooper  aldwu^  Maron  in 


1964  already  claimed  its  optimality  (Maron, 
1965). 

The  probability  ratUang  principle.  If  a 
reference  retrieval  system's  response 
to  each  request  is  a  ranking  of  the 
documents  in  tire  collection  in  order  of 
decreasing  probability  of  relevance  to 
the  user  who  submitted  the  request, 
where  die  probabilities  are  estimated  as 
accurately  as  possible  on  the  basis  of 
whatever  data  have  been  made 
available  to  the  system  for  this 
purpose,  the  overall  effectiveness  of 
the  system  to  its  user  will  be  the  best 
that  is  obtainable  on  the  basis  of  those 
data. 

Of  course,  this  principle  raises  many 
questions  as  to  the  acceptability  of  the 
assumptions.  For  example,  the  Cluster 
Hypothesis,  that  closely  associated 
documents  tend  to  be  relevant  to  the  same 
requests,  explicitly  assumes  the  contrary  of 
assumption  (t).  Goffman  (1964)  too,  in  his 
work,  has  gone  to  some  pains  to  make  an 
explicit  assumption  of  dependence.  I  quote; 
Thus,  if  a  document  x  has  been  assessed  as 
relevant  to  a  query  s,  the  relevance  of  the 
otiier  documents  in  the  file  X  may  be  affected 
since  the  value  of  the  inftxmation  conveyed 
by  these  documents  may  either  increase  or 
decrease  as  a  result  of  the  infonnation 
conveyed  by  the  document  X.  Then  there  is 
the  question  of  the  way  in  which  overall 
effectiveness  is  to  be  measured.  Robertson 
in  his  paper  shows  the  probability  ranking 
principle  to  hold  if  we  measure  efftetiveness 
in  toms  of  Recall  and  Fallout 

But  this  is  not  the  place  to  argue  out  these 
research  questions.  However,  I  do  think  it 
reasonable  to  adopt  the  principte  as  one  upon 
which  to  construct  a  probabilistic  retrieval 
model.  One  wo^  of  warning,  the 
probability  ranking  principle  can  only  be 
shown  to  be  true  for  one  query.  It  does  not 
say  that  the  performance  over  a  range  of 
queries  will  be  optimised;  to  establish  a 
result  of  this  kind  one  would  have  to  be 
specific  about  how  one  would  average  the 
performance  across  queries. 

A  detailed  description  of  the  pn^bilistic 
model  can  be  found  in  Cliap^  6  of  van 
Rijsber^n  (1979).  There  are  several  other 

rod  summarres  e.g.  Harper  (1980),  Salton 
McGUl  (1983). 


Cluster  representatives 


Before  we  can  sensibly  talk  about  search 
strategies  applied  to  clustered  document 
collections,  we  need  to  say  a  litde  about  the 
mediods  us^  to  represent  clusters.  Whereas 
in  a  serial  search  we  need  to  be  able  to  match 
queries  with  each  document  in  the  Ble,  in  a 
search  of  a  clustered  tile  we  need  to  be  able  to 
match  queries  with  clusters.  For  this 
purpose,  clusters  are  represented  by  some 
kind  of  profile  (a  much  overworked  word), 
which  here  will  be  called  a  cluster 
representative.  It  attempts  to  summarise  and 
characterise  the  cluster  of  documents. 

A  cluster  representative  should  be  such  that 
an  incoming  query  will  be  diagnosed  into  the 
cluster  containing  the  documents  relevant  to 
the  query.  In  other  words,  we  expect  the 
cluster  representative  to  discriminate  the 
relevant  from  the  non-relevant  documents 
when  matched  against  any  query.  This  is  a 
tall  order  and,  unfortunately,  there  is  no 
themy  enabling  one  to  select  the  right  kind  of 
cluster  representadve  (but  see  Croft,  1979). 
One  can  only  proceed  experimentally.  There 
are  a  number  of  'reasonable'  ways  of 
characterising  clusters;  it  then  remains  a 
matter  for  experimental  test  to  decide  which 
of  these  is  the  most  effective. 


A 


Let  me  ffrst  give  an  example  of  a  very 
primitive  cluster  representative.  If  we 
assume  that  the  clusters  are  derived  ffom  a 
cluster  method  based  on  a  dissimilarity 
measure  (see  van  Rijsbergen  1979),  then  we 
can  represent  each  cluster  at  some  level  of 
dissimilarity  by  a  graph  (see  Figure  I).  Here 
A  and  B  are  two  clusters.  The  nodes 
represent  documents  and  the  line  between  any 
two  nodes  indicates  that  their  corresponding 
documents  are  less  dissimilar  than  some 
specified  level  of  dissimilarity.  Now,  one 
way  of  representing  a  cluster  is  to  select  a 
typical  member  from  the  cluster.  A  simple 
way  of  doing  this  is  to  ffnd  that  document 
which  is  linked  to  the  maximum  number  of 
other  documents  in  the  cluster.  A  suitable 
name  for  this  kind  of  cluster  representadve  is 
the  maximally  linked  document.  In  the 
clusters  A  and  B  illustrated,  there  are 
p)ointers  to  the  candidates.  As  one  would 
expect  in  some  cases,  the  representadve  is  not 
unique.  For  example,  in  cluster  B  we  have 
two  candidates.  To  deal  with  this,  one  either 
makes  an  arbitrary  choice  or  one  maintains  a 
list  of  cluster  representadves  for  that  cluster. 
The  motivation  leading  to  this  particular 
choice  of  cluster  representadve  is  given  in 
some  detail  in  van  Rijsbergen  (1974a)  but 
need  not  concern  us  here. 


B 


Figure  I .  Examples  of  maximally  linked  documents  as  cluster  representatives 


Let  us  now  look  at  other  ways  of 
lepiesenting  clusters.  We  seek  a  method  of 
rqtresentadon  which  in  some  way  'averages' 
the  descriptions  of  the  membm  of  the 
clusters.  The  method  that  immediately 


springs  to  mind  is  one  in  which  one 
emulates  the  centroid  (or  centre  of  gravity) 
trf’ the  cluster.  If  (Oi.f>2>  •  •  ■I'c  tte 
documents  in  the  cluster  and  each  Dj  is 
represented  by  a  numerical  vector  (di,  d2, .  ■ 
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dt)  then  the  centroid  C  of  the  cluster  is 
given  by 


where  Ilf7  i  II  is  usually  the  Euclidean  norm, 
i.e. 


ID.  I  »Vd?+  d\  +  ... 

More  often  than  not  the  documents  are  not 
represented  by  numerical  vectors  but  by 
binary  vectors  (or  equivalently,  sets  of 
keywords).  In  that  case,  we  can  sdll  use  a 
centroid  type  of  cluster  representative  but  the 
normalisation  is  replaced  with  a  process 
which  thresholds  the  components  of  the  sum 

ZD|.  To  be  more  precise,  let  D,-  now  be  a 
binary  vector,  such  that  a  1  in  thej  th  posidem 
indicates  the  presence  of  the  j  th  keyword  in 
the  document  and  a  0  indicates  the  contrary. 
The  cluster  representative  is  now  derived 
dom  the  sum  vector 


5  =  £  D, 


1=1 


(remember  n  is  the  number  of  documents  in 
the  cluster)  by  the  following  procedure. 


Let  C  =  (ci,  C2.  .  ■  •  c»)  be  the  cluster 
representative  and  [DiJy  the  jth  component  of 
the  binary  vector  D„  then  two  methods  are: 


(1) 


c 

j 


n 


1  if  E  Pi  ];  >  1 
1=  1 


0  otherwise 


or 


(2) 


H 

1  if  E  P,  Ij  >  ‘ofe" 

0  otherwise 


So,  finally  we  obtain  as  a  cluster  repre¬ 
sentative  a  binary  vector  C.  In  both  cases 
the  intuition  is  that  keywords  occurring  only 
once  in  the  cluster  shoiuld  be  ignored.  In  the 


second  case  we  also  normalise  out  the  size  n 
of  the  cluster. 

There  is  some  evidence  to  show  that  both 
these  methods  of  representation  are  effective 
when  used  in  conjunction  with  appropriate 
search  strategies  (see,  for  example,  van 
Rijsbergen,  1974b,  and  Murray,  1972). 
Obviously  there  are  further  variations  on 
obtaining  cluster  representatives  but,  as  in  the 
case  of  association  measures,  it  seems 
unlikely  that  retrieval  effectiveness  will 
change  very  much  by  varying  the  cluster 
representatives.  It  is  more  likely  that  the 
way  the  data  in  the  cluster  representative  is 
used  by  the  search  strategy  will  have  a  larger 
effect. 

There  is  another  theoretical  way  of  looking  at 
the  construction  of  cluster  representatives  and 
that  is  through  the  notion  of  a  maximal 
predictor  for  a  cluster  (Gower,  1974).  Given 
that,  as  before,  the  documents  Dj  in  a  cluster 
are  binary  vectors  then  a  binary  cluster 
representative  for  this  cluster  is  a  predictor  in 
the  sense  that  each  component  (cj)  predicts 
that  the  most  likely  value  of  that  attribute  in 
the  member  documents.  It  is  maximal  if  its 
correct  predictions  are  as  numerous  as 
possible.  If  one  assumes  that  each  member 
of  a  cluster  of  documents  D ],...,  Dn  is 
equally  likely,  then  the  expected  total  number 
of  incorrect  predicted  properties  (or  simply 
the  expected  total  number  of  mismatches 
between  cluster  representative  and  member 
documents  since  everything  in  binary)  is. 


E  E  (P.  l,  -  Cj)^ 

i=l  ;=l 

This  can  be  rewritten  as 


E  E  aDi]j-D./^n  ;^(ip.  j  -c.)^ 

i=I  ;-=l  ;=l  '  ^ 


where 


E  p,  ], 

IX  1 


The  expression  (*)  will  be  minimised,  thus 
maximising  the  number  of  correct 


predictions,  when  C  =  (ci,  .  .  .  ,  Cr)  is 
chosen  in  such  a  way  that 

Z  ([D.  l^-c  )^ 

j=i  ^  ’ 

is  a  mininium.  This  is  achieved  by 

1  if  D..  >V^ 

0  otherwise 

So,  in  other  words,  a  keyword  will  be 
assigned  to  a  cluster  representative  if  it  occurs 
in  more  than  half  the  member  documents. 
This  treats  errors  of  prediction  caused  by 
absence  or  presence  of  keywords  on  an  equal 
basis.  Croft  (1979)  has  shown  that  it  is 
more  reasonable  to  differentiate  the  two 
types  of  error  in  IR  applications.  He 
showed  that  to  predict  falsely  0  (Cj  =  0)  is 
more  costly  than  to  predict  falsely  a  1  (cj  = 
1).  Under  this  assumption  the  value  of  V2 
appearing  in  (3)  is  replaced  by  a  constant  less 
than  1/2,  its  exact  value  being  related  to  the 
relative  importance  attached  to  the  two  types 
of  prediction  error. 

Although  the  main  reason  for  constructing 
these  cluster  representatives  is  to  lead  a 
search  strategy  to  relevant  documents,  it 
should  be  clear  that  they  can  also  be  used  to 
guide  a  search  to  documents  meeting  some 
condition  on  the  matching  function.  For 
example,  we  may  want  to  retrieve  all 
documents  Dt  which  match  Q  better  than  T, 
i.e. 

{Di\M(Q.Di)>T] 

For  more  details  about  the  evaluation  of 
cluster  reimesentative  (3)  for  this  purpose  the 
reader  should  consult  tlK  work  of  Yu  and 
Luk  (1977). 

One  major  objection  to  most  work  on  cluster 
representatives  is  that  it  treats  the  distribution 
of  keywords  in  clusters  as  independent. 
This  is  not  very  realistic.  Unfortunately, 
there  does  not  appear  to  be  any  work  to 
remedy  the  situation  except  that  of 
Ardnaudov  and  Govorun  (1977),  and 
perhaps  that  of  El-hamdouchi  (1987). 


Finally,  it  should  be  noted  that  cluster 
methods  which  proceed  directly  from 
document  descriptions  to  the  classification 
without  first  computing  the  intermediate 
dissimilarity  coefficient,  will  need  to  make  a 
choice  of  cluster  representative  ab  initio. 
These  cluster  representatives  are  then 
'improved'  as  the  algorithm,  adjusting  the 
classification  according  to  some  objective 
function,  steps  through  its  iterations. 


Cluster-based  retrieval 

Cluster-based  retrieval  has  as  its  foundation 
the  cluster  hypothesis,  which  states  that 
closely  associated  documents  tend  to  be 
relevant  to  the  same  requests  (van  Rijsbergen 
and  Sparck  Jones,  1973).  Clustering  picks 
out  closely  associated  documents  and  groups 
them  together  into  one  cluster.  In  van 
Rijsbergen  (1979),  Chapter  3,  I  discussed 
many  ways  of  doing  this;  here  I  shall  ignore 
the  actual  mechanism  of  generating  the 
classirication  and  concentrate  on  how  it  may 
be  searched  with  the  aim  of  retrieving 
relevant  documents. 

Suppose  we  have  a  hierarchic  classification 
of  documents  then  a  simple  search  strategy 
goes  as  follows  (refer  to  Figure  2  for  details). 
The  search  starts  at  the  root  of  the  tree,  node 
0  in  the  example.  It  proceeds  by  evaluating  a 
matching  function  at  the  nodes  immediately 
descendant  from  node  0,  in  the  example  the 
nodes  1  and  2.  This  pattern  repeats  itself 
down  the  tree.  The  search  is  directed  by  a 
decision  rule  which,  on  the  basis  of 
comparing  the  values  of  a  matching  function 
at  each  stage,  decides  which  node  to  expand 
further.  Also,  it  is  necessary  to  have  a 
stepping  rule  which  terminates  tte  search  and 
forces  a  retrieval.  In  Figure  2  the  decision 
rule  is:  expand  the  node  corresponding  to  the 
maximum  value  of  the  matching  function 
achieved  within  a  filial  set.  The  stopping 
rule  is:  stop  if  the  current  maximum  is  less 
than  the  previous  maximum.  A  few  remarks 
about  this  strategy  are  in  order: 

(1)  we  assume  that  effective  retrieval  can 
be  achieved  by  finding  just  one 
clustm; 

(2)  we  assume  that  each  cluster  can  be 
adequately  represented  by  a  cluster 
representative  for  the  puriwse  of 
locating  the  cluster  containing  the 
relevant  documents; 


(i 


Figure  2.  A  search  tree  and  the  appropriate  values  of  a  matching  function  illustrating  the  action 
of  a  decision  rule  and  a  stopping  ride. 


(3)  if  the  maximum  of  the  matching 
function  is  not  unique  some  special 
action,  such  as  a  look-ahead,  will 
need  to  be  taken; 

(4)  the  search  always  terminates  and  will 
retrieve  at  least  tme  document 

An  immediate  generalisation  of  this  search  is 
to  allow  the  search  to  proceed  down  more 
than  one  branch  of  the  tree  so  as  to  allow 
retrieval  of  more  than  one  cluster.  By 
necessity  the  decision  rule  and  stopping  rule 
will  be  slightly  more  complicated.  The  main 
difference  being  that  provision  must  be  made 
for  back-tracking,  lliis  will  occur  when  the 
search  strategy  estimates  (based  on  the 
current  value  of  the  matching  function)  that 
further  progress  down  a  branch  is  a  waste  of 
time,  at  which  point  it  may  or  may  not 
retrieve  the  current  cluster.  The  search  then 
returns  (back-tracks)  to  a  previous  branching 
point  and  takes  an  alternative  branch  down 
the  tree. 

The  above  strategies  may  be  described  as  top- 
down  searches.  A  bottom-up  search  is  one 
which  enters  the  tree  at  one  of  its  terminal 
nodes,  and  proceeds  in  an  upward  direction 
toward  the  root  of  the  tree.  In  this  way  it 
will  pass  through  a  sequence  of  nested 


clusters  of  increasing  size.  A  decision  rule  is 
not  required;  we  only  need  a  stopping  rule 
which  could  be  simply  a  cut-off.  A  typical 
search  would  seek  the  largest  cluster 
containing  the  document  represented  by  the 
starting  node  and  not  exceeding  the  cut-off  in 
size.  Once  this  cluster  is  found,  the  set  of 
documents  in  it  is  retrieved.  To  initiate  the 
search  in  response  to  a  request,  it  is 
necessary  to  know  in  advance  one  terminal 
node  appropriate  for  that  request  It  is  not 
unusual  to  Bnd  that  a  user  will  already  know 
of  a  document  relevant  to  his  request  and  is 
seeking  other  documents  similar  to  it.  This 
'source'  document  can  thus  be  used  to  initiate 
a  bottom-up  search.  For  a  systematic 
evaluation  of  bottom-up  searches  in  terms  of 
efficiency  and  effectiveness  see  Croft  (1979). 

If  we  now  abandon  the  idea  of  having  a 
multi-level  clustering  and  accept  a  single-level 
clustering,  we  end  up  with  the  approach  to 
document  clustering  which  Salton  and  his  co¬ 
workers  have  workv-J  on  extensively.  The 
search  strategy  is  in  part  a  serial  search.  It 
proceeds  by  fust  finding  the  best  (or  nearest) 
cluster(s)  and  then  looking  within  these. 
The  second  stage  is  achieved  by  doing  a 
serial  search  of  the  documents  in  the  selected 


cluster(s).  The  output  is  frequently  a 
ranking  of  the  documents  so  retrieved. 


Interactive  search  formulation 

A  user  confronted  with  an  automatic  retrieval 
system  is  unlikely  to  be  able  to  express  his 
information  need  in  one  go.  He  is  more 
likely  to  want  to  indulge  in  a  trial-and-eiror 
process  in  which  he  formulates  his  query  in 
the  light  of  what  the  system  can  tell  him  about 
his  query.  The  kind  of  information  that  he  is 
likely  to  want  to  use  for  the  reformulation  of 
his  query  is: 

(1)  the  frequency  of  occurrence  in  the 
data  base  of  his  search  terms; 

(2)  the  number  of  documents  likely  to  be 
retrieved  by  his  query; 

(3)  alternative  and  related  terms  to  be  the 
ones  used  in  his  search; 

(4)  a  small  sample  of  the  citations  likely 
to  be  letrieved;  and 

(5)  the  terms  used  to  index  the  citations 
in  (4). 

All  this  can  be  conveniently  provided  to  a 
user  during  his  search  session  by  an 
interactive  retrieval  system.  If  he  discovers 
that  one  of  his  search  terms  occurs  very 
frequently,  he  may  wish  to  make  it  more 
specific  by  consulting  a  hierarchic  dictionary 
which  will  tell  him  what  his  options  are. 
Similarly,  if  his  query  is  likely  to  retrieve  too 
many  documents,  he  can  make  it  more 
specific. 

The  sample  of  citations  and  their  indexing 
will  give  him  some  idea  of  what  kind  of 
documents  are  likely  to  be  retrieved  and  thus 
some  idea  of  how  effective  bis  search  terms 
have  been  in  expressing  his  information 
need.  He  may  modify  his  query  in  the  light 
of  this  sample  retrieval.  This  process,  in 
which  the  user  modifies  his  query  based  on 
actual  search  results,  could  be  described  as  a 
form  of  feedback. 

We  now  look  at  a  mathematical  approach  to 
the  use  of  feedback  where  the  system 
automatically  modifies  the  query. 


Feedback 

The  word  feedback  is  normally  used  to 
(tescribe  the  mechanism  by  which  a  system 


can  improve  its  performance  on  a  task  by 
taking  account  of  past  performance.  In  other 
words,  a  simple  input-output  system  feeds 
back  the  information  from  the  output  so  that 
this  may  be  used  to  improve  the  performance 
on  the  next  input.  The  notion  of  feedback  is 
well  established  in  biological  and  automatic 
control  systems.  It  has  been  popularised  by 
Norben  Wiener  in  his  book  Cybernetics.  In 
information  retrieval  it  has  been  used  with 
considerable  effect 

Consider  now  a  retrieval  strategy  that  has 
been  implemented  by  means  of  a  matching 
function  M.  Furthermore,  let  us  suppose 
that  both  the  query  Q  and  document 
representatives  D  arc  t-dimensional  vectors 
with  real  components  where  t  is  the  number 
of  index  terms.  Because  it  is  my  purpose  to 
explain  feedback,  I  will  consider  its 
applications  to  a  serial  search  only. 

It  is  the  aim  of  every  retrieval  strategy  to 
retrieve  the  relevant  documents  A  and 
withhold  the  non-relevant  documents  A. 
Unfortunately  relevance  is  defined  with 
respect  to  the  user’s  semantic  interpretation  of 
his  query.  From  the  point  of  view  of  the 
retrieval  system,  his  formulation  of  it  may  not 
be  ideal.  An  ideal  formulation  would  be  one 
which  retrieved  only  the  relevant  documents. 
In  the  case  of  a  serial  search  the  system  will 
retrieve  all  D  for  which  MiQJ))  >  T  and  not 
retrieve  any  D  for  which  S  T,  where 

r  is  a  specified  threshold.  It  so  happens  that 
in  the  case  where  M  is  the  cosine  correlation 
function,  i.e. 

M(Q,D)  = 

iQ-D)  ’  ,  .  . 

-  - - (9,  4  +  9,4  ^...q  d), 

leil  IDII  IIQII  ICII  >  1  2  2  II 


the  decision  procedure 

MiQfi)  -  T>0 

corresponds  to  a  linear  discriminant  function 
used  to  linearly  separate  two  sets  A  and  Ain 
Nilsson  (1965)  has  discussed  in  great 
detail  how  functions  such  as  this  may  be 
'trained'  by  modifying  the  weights  qt  to 
discriminate  correctly  between  two 
categories.  Let  us  suppose  for  the  moment 
that  A  and  A  are  known  in  advance,  then  the 


5-11) 


correct  query  formulation  Qo  would  be  one 
fw  which 

M(QoJD)  >  T  whenever  D  e  A 

and 

whenever De  A 


The  interesting  thing  is  that  starting  with  any 
Q  we  can  adjust  it  iteratively  using  feedback 
information  so  that  it  will  converge  to  Qq. 
There  is  a  theorem  (Nilsson,  1965,  page  81) 
which  states  that,  providing  Qo  exists,  there 
is  an  iterative  procedure  which  will  ensure 
that  Q  will  converge  toQoina finite  number 
of  steps. 

The  iterative  procedure  is  called  the  fixed- 
increment  error  correction  procedure. 


It  goes  as  follows; 


+  cD 

if 

M(Qi  l,D)  -  r^o 

and 

De  A 

Qi  =  Ci-l 

-  cD 

if 

-  r>o 

and 

De  A 

and  no  change  made  to  j2i-l  if  it  diagnoses 
correctly,  c  is  the  correction  increment,  its 
value  is  arbitrary  and  is  therefore  usually  set 
to  unity.  In  practice  it  may  be  necessary  to 
cycle  through  the  set  of  documents  several 
times  before  the  correct  set  of  weights  are 
achieved,  namely  those  which  will  separate  A 

and  A  linearly  (this  is  always  providing  a 
solution  exists). 

The  situation  in  actual  retrieval  is  not  as 
simple.  We  do  not  know  the  sets  A  and  A 
in  advance,  in  fact  A  is  the  set  we  hope  to 
retrieve.  However,  given  a  query 
formulation  Q  and  the  doraments  retrieved  by 
it,  we  can  ask  the  user  to  tell  the  system 
which  of  the  documents  retrieved  were 
relevant  and  which  were  not.  The  system 


can  then  automatically  modify  Q  so  that  at 
least  it  will  be  able  to  diagnose  correctly  those 
documents  that  the  user  has  seen.  The 
assumption  is  that  this  will  improve  retrieval 
on  the  next  run  by  virtue  of  the  fact  that  its 
performance  is  better  on  a  sample. 

Once  again  this  is  not  the  wiiole  story.  It  is 
often  dSficult  to  Hx  the  threshold  in  advance 
so  that  instead  documents  are  ranked  in 
decreasing  matching  value  on  output.  It  is 
now  m«e  difficult  to  define  what  is  meant  by 
an  ideal  query  formulation.  Rocchio  (1966) 
in  his  thesis  defined  the  optimal  query  as 
one  which  maximised: 


I  M(G.D)  -_L  I  M(Q.D) 
'ASDeA 


If  M  is  taken  to  be  the  cosine  function  (Q,  O) 
/IIQ  II  IID  II  then  it  is  easy  to  show  that  d>  is 
maximised  by 


where  c  is  an  arbitrary  proportionality 
constant. 

If  the  summations  instead  of  being  over  A 
and  A  are  now  made  over  A  B/  and  A  n 
B,  where  Bj  is  the  set  of  retrieved  documents 
on  the  ith  iteration,  then  we  have  a  query 
formulation  which  is  optimal  for  fi/  a  subset 
of  the  document  collection.  By  analogy  to 
the  linear  classifier  used  before,  we  now  add 
this  vector  to  the  query  formulation  on  the  ith 
step  to  get; 


tv 


1 

UriBj 


De  An B. 


D 

IID  I 


UnB,I  DeAnBi 
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where  wj  and  W2  are  weighting  coefficients. 
Salton  (1968),  in  fact,  used  a  slightly 
modified  version.  The  most  important 
difference  being  that  there  is  an  option  to 
generate  Qi+i  from  Q,-.  or  Q,  the  original 
query.  The  effect  of  till  these  adjustments 
may  be  summarised  by  saying  that  the  query 
is  automatically  modiHed  so  Aat  index  terms 
in  relevant  retrieved  documents  are  given 
more  weight  (promoted)  and  index  terms  in 
non-relevant  di^uments  are  given  less  weight 
(demoted). 

Experiments  have  shown  that  relevance 
feedback  can  be  very  effective.  It  is  now 
one  of  the  techniques  that  is  frequently 
implemented  in  new  operational  systems. 

Finally,  a  few  comments  about  the  technique 
of  relevance  feedback  in  general.  It  appears 
to  me  that  its  implementation  on  an 
operational  basis  may  be  more  problematic. 
It  is  not  clear  how  users  are  to  assess  the 
relevance,  or  non-relevance,  of  a  document 
from  such  scanty  evidence  as  citations.  In 
an  operational  system  it  is  easy  to  arrange  for 
abstracts  to  be  output  but  it  is  likely  that  a 
user  will  need  to  browse  through  the 
retrieved  documents  themselves  to  determine 
their  relevance  after  which  he  may  well  wish 
to  control  the  query  adjustment  himself  or,  at 
least,  partially  influence  any  automatic  adjust¬ 
ments  made. 
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Abstract:  This  paper  opens  with  a  brief  history  of  hypertext  end  hypermedia  in  the  context 
of  Information  management  during  the  'information  age.'  Relevant  terms  are  defined  and  the 
approach  of  the  paper  is  explained.  Linear  and  hypermedia  information  access  methods  are 
contrasted.  A  discussion  of  hyperprogramming  in  the  handling  of  complex  scientific  and 
technical  information  follows.  A  selection  of  innovative  hypermedia  systems  Is  discussed.  An 
analysis  of  the  Clinical  Practice  Library  of  Medicine  NASA  ST1  Program  hypermedia  application 
is  presented.  The  paper  concludes  with  a  discussion  of  the  NASA  STI  Program’s  future 
hypermedia  project  plans. 


Iniroduetfon 

The  Information  management  and  media  production 
environment  have  changed  dramatically  In  the  last  Ian 
years.  The  development  of  new  communications  systems, 
more  powerful  microcomputers,  optical  storage 
technologies,  Imaging  and  scanning  technologies,  and 
animation  and  Interactive  video  technologies  has 
dramatically  altered  the  structure  of  society  so  that  we 
now  liva  In  an  Information  age.  In  today's  society,  the 
choability  to  retrieve,  manage,  and  use  Information  is  of 
amount  importance.  Information  retrieval  is  concerned 
m  the  representation,  storage  and  retrieval  of 
documents  or  document  surrogates.  |CROU89|  This 
technological  revolution  is  changing  the  way  we  think  of 
InldrmaMon  retrieval  and  forcing  an  expanded  definition  of 
familiar  wrms  such  as  'documenr.  At  one  lime  a  document 
meant  a  formal  or  legal  paper  such  as  a  hardcopy  technical 
report.  Now  information  pundits  are  adopting  a  broader 
definition  of  document  in  saying  that  a  document  Is 
'recorded  Information  structured  for  human 
comprehension.'  [LEVIst]  This  paper  approaches 
informaflon  retrieval  with  the  broadest  perspective 
possibls  In  Including  documents  in  both  familiar  and  still 
emerging  forms. 

This  intensifying  technological  revolution  has  added 
Impetus  to  the  acceptance  of  new  methodologies  lor 
Information  and  knowl^e  retrieval  and  managsinem.  A 
methodofogy  lo  be  addrened  here  Is  hypertext.  Ted  Nelson 
coined  the  term  hypertext  In  1965  to  desertoe  a  system  of 
computer-supported,  nonsequential  information 
processing.  Early  hypertext  systems  developed  on 
makilfama  computers  made  It  pos^a  for  users  to  create 
and  explors  kiformadon  kilaracllvely.  The  central  concept 
was  the  abfHty  to  create  computer-supported  links  or 
cross-references  permitting  rapid,  easy  movement 
between  related  Information.  The  abiaqr  of  the  user  to 
conlrel  Ms  path  torough  the  Information  and  annotate  or 
add  to  die  tofatmadeo  were  also  key  concepts.  Hypertext  Is 
actuady  a  subest  of  a  lenger  ischtwtogy  called  hypermedia 
that  Is  now  available  on  computers  ranging  from 
makilrainss  to  micros.  Hypermedfa  sxtsnds  the  hypertext 
concept  to  Ink  not  only  textual  matarIM,  but  ad  forms  of 
malarial  that  may  be  dtgilaly  encoded  for  stor^  and 
rairteval  dirough  compuMr-basad  systems  such  at  Imaget, 
sound,  graph^  and  vidso  (CHeW)|.  To  acknowlsdge  die 
multiple  forms  ol  kitormatlon  In  such  tysitms,  the  term 
muldmedto  hat  bten  sppded  to  hypemtedla  systems,  in  the 
Msralure  today,  one  often  seat  the  ternis  hypermetda  and 
muMmadto  used  kitarchangeably.  However,  whereat  a 
hypermaida  system  can  be  correctly  desertbed  as  a 


multimedia  system,  a  multimedia  system  is  not  necessarily 
a  hypermedia  system.  The  mere  inclusion  ol  multiple 
forms  of  information  linked  to  each  other  is  not  enough  to 
be  a  hypermedia  system.  Only  when  users  can 
Interactively  lake  control  of  a  set  of  dynamic  links  among 
units  of  information,  can  the  system  be  correctly  referred 
to  as  a  hypermedia  system. 

Linear  Access  vs.  Hypermedia  Trails 

Reading  is  fundamentally  linear.  Words  are  grouped 
together  to  form  sentences,  sentences  to  form  paragra^s, 
paragraphs  to  forni  documents.  Each  has  a  beginning  and  is 
read  through  to  the  end.  If  the  reader  Is  searching  tor  a 
specific  pi^  ol  Infonnatlon  contained  In  the  document,  the 
document  Is  likely  to  be  skimmed  or  perused  linearly  until 
the  information  Is  found.  This  Is  true  of  other  linear 
mediums  such  as  videotapes  as  well.  With  large 
Information  oriented  documents  such  as  texts,  the  reader 
may  use  an  index  to  locate  the  unit  containing  the 
particular  relerenca.  The  unit  Is  then  read  linearly  until 
the  Infomtation  is  found.  Even  though  the  user  can  turn 
directly  lo  the  location  of  the  information,  the  act  of 
extraction  Is  still  linear.  The  document  itself  was 
designad  to  be  accessed  with  a  clear  path  through  the 
kitormatlon  from  beginning  to  end.  Ahhough  the  analogy 
Is  strained,  this  linear  Information  access  method  can  be 
viewed  as  corresponding  to  traditional  information  access 
methods  as  appbed  to  such  text  products  when  reproduced 
in  computer  usable  formats  [BELLBT].  In  contrast, 
hypertext  systems  may  provide  the  user  with  an  initial 
linear  access  method,  but  at  any  given  location  In  the 
Information,  the  user  has  the  op^  ol  selecting  one  to 
many  ftidtiar  rafaranoes.  In  soma  svsfems.  a  view  of  both 
the  bwomlng  and  outgoing  references  Is  available.  As 
menttonad  above,  hypermedia  systems  allow  these 
references  to  consist  ol  any  recorded  Information 
stnicluiad  tor  human  comprehension  that  can  be  accessed 
via  a  computer.  Thus.  wHh  such  systems  toe  and  user  can 
pursue  dMa  rsisrancae  by  tolowing  a  satf-ssleclad  trail  or 
comblnatton  of  rale  ihrough  the  data  |BELU7). 

Hyperpregraimnlng  and  STI 

Hyparpragrammlng  Is  the  process  of  crsaling  hypertext  or 
hyparmadto  appHrtoions.  Although  the  kidusion  of  the  mot 
word  *program'  is  used,  professional  programmers  or 
aollwais  anginsers  are  NOT  raqubad  to  craala  hyparmadto 
appMcattons,  The  newest  hypermedia  authoring  torsleme 
have  bean  designad  to  pul  hypermedia  authoring  kilo  die 
hands  of  and  users  so  that  they  can  bring  their  Ideas  to  life 
wHhoul  having  to  master  oomputsr  programnkng  to  do  so. 
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Hypennedia  is  often  presented  as  a  new  medium  (akin  to 
the  invention  of  paper)  that  has  tremendous  potential  to 
transform  society.  As  a  communications  medium, 
hypermedia  merges  the  three  separate  technologies  of 
motion  pictures,  publishing,  and  computing  with  expertise 
drawn  from  all  three  disciplines  to  develop  authoring 
systems  designed  to  be  accessible  to  any  computer  user. 

Just  as  other  new  madia  have  required  a  technological 
infrastructure  before  becoming  widespread,  hypermedia's 
popularity  has  required  the  availability  of  powarlul 
desktop  microcomputers  with  considerable  storage 
available  to  handle  large  webs  of  multimedia  nodes  and 
links  [WARR90).  Additionally,  the  recent  emergence  ol 
systems  that  automatically  generate  hypertext  documents 
from  linear  text  documents  has  provided  renewed  interest 
in  hypertext  and  hypermedia  from  organizations  that 
originally  re|ected  the  technology  as  too  labor-intensive. 

Scientific  and  technical  information  (STI)  differs  from 
conventional  and  non-sclenlific  information  in  two 
important  ways.  First,  the  syntactic  and  semantic 
elements  found  in  the  text  are  subject-area  specific  and 
consequently  there  Is  only  a  limited  amount  of  overlap  with 
those  pieces  of  text  which  are  also  subject-area  specific 
but  whose  subjects  of  discussion  are  different.  Secondly, 
STI  Is,  in  general,  rich  in  pragmatic  content.  The 
pragmatic  aspects  relevant  to  STI  include  items  such  as 
references  'o  work  by  other  authors,  figures,  tabular  data, 
charts,  malnematical  expressions,  and  even  videotapes  ol 
tests,  equipment  launches,  etc.  (MARSStJ.  Consider  the 
following  typical  excerpt  of  scientific  and  technical 
information; 

The  science  needs,  as  expressed  at  the  High 
Hesoludon  High  Frame  Rate  Video  Workshop  held  May  1 1- 
12,  1968  at  NASA's  Lewis  Research  Center  (see  Ret.  7), 
were  largely  validated  by  our  review  and  analysis  of  those 
needs.  Table  f  oonlains  a  summary  of  some  of  the  greatest 
Imaging  needs.  To  be  able  to  serve  this  domain  of  Imaging 
and  its  attended  data  rates  requires  significant  computer, 
recording,  power,  space,  and  weight  resources  (sea  Ref. 
S).  [HAND90I 

The  typical  user  might  be  Interested  In  looking  at  the  video 
tapes  ol  the  workshop,  examining  the  contents  ot  table  1 
arid  possibly  manipulating  them,  and  obtaining  addit'onal 
details  on  the  article  died  in  Reference  8.  All  ol  these 
pieces  of  Information  could  be  accessed  within  a 
hypermedia  system  by  activating  a  Ink.  The  ease  of  access 
stands  in  sharp  contrast  to  juggling  a  cited  article,  a 
hardcopy  tabis  of  data,  a  formula  and/or  software  to 
manipulate  the  data,  and  a  videorecorder  to  play  the 
videotapes.  Additionally,  the  additional  complexity  added 
by  the  data  table  and  other  pieces  ol  Information  could  be 
concealed  or  revealed  to  facilitate  navigation  through 
complax  levels  ot  abstraction. 

The  aerospace  and  defense  community  is  primarily  a 
community  of  scientists  and  enginssrs  that  are  notorious 
lor  depending  upon  informal  communicalion  for 
Information  transfer.  Hypermedia  can  improve  upon  the 
informal  communication  channels  used  for  information 
transfer.  The  collaborallon  that  now  lakes  place  via 
electronic  ma>  and  computer  conferences  can  be  enhanced 
by  hypertext  or  hypermedia  mechanisms.  Communications 
networks  that  span  national  boundaries  provide  the 
connectivity  necessar)/  to  allow  multi-national 
partlc^lton  In  such  coRaboralions  with  Information 
nodes  annotated  by  any  participant  at  any  time.  As  data 
compression  and  networking  lachnologlas  advance,  nre  win 
see  hypermedia  nodes  and  annotations  in  addition  to  text. 
AsyncRtronous  annotalian  of  Infixmation  nodes  is  especially 
prscUcal  for  ooNaboratlofts  across  time  zones  where 
patftcipanta  are  not  all  Riely  to  be  awake  at  the  same  Wme. 
During  the  Inlertm,  oomputer-aided  systems  (like 
SysMi)  can  be  used  to  address  the  languags  banters  that 
exiat  Thus,  hypermedto  leehnctogy  can  be  used  to  enable 


collaboration  despite  barriers  of  distance,  time,  or 
language. 

Within  the  aerospace  and  defense  community,  complex, 
long-term  projects  necessitate  the  creation  of  knowledge 
bases  that  transcend  the  involvement  spans  of  individual 
personnel.  NASA  scientists  and  engineers  have 
particularty  mentioned  their  need  to  record  the  expertise 
ol  key  individuals  so  that  it  will  not  be  lost  when  they 
retire  or  leave.  Hypermedia-based  associative  memories 
are  ideal  solutions  to  these  problems  since  all  of  the  rich 
pragmatic  content  typical  of  STI  can  be  included. 
Hypermedia-based  associative  memories  can  recall 
information  even  when  queries  are  inrxrmplele  or  garbled, 
can  store  data  In  a  distributed  fashion,  can  detect 
similarities  between  new  inputs  and  previously  stored 
patterns,  and  do  not  degrade  appreciably  in  performance  if 
some  ol  the  memory’s  components  are  damaged-all  useful 
characteristics  tor  a  distributed,  shared  organizational 
knowledge  base  {WARR90].  Hypermedia  associative 
memories  form  the  basis  for  the  performance  support 
systems  that  are  just  beginning  to  be  seen  in  Government 
and  Industry. 

The  main  value-added  of  hypermedia  systems  to  the  STI 
community  ties  in  the  ability  of  hypermedia  to  handle  the 
full  spectrum  of  STI's  pragmatic  content  from  data 
manipulation  to  v'Kteo  display.  The  presentation  of  standard 
textual  information  only  just  begins  to  take  advantage  of 
hypermedia's  strengths  and  suffers  the  disadvantages 
associated  with  forcing  users  to  read  large  amounts  of  text 
from  today's  computer  screens  (NIEL90). 

In  summary,  hyperprogramming  is  well  suited  to  STI. 
Advocates  of  hypennedia  pose  the  following  arguments  for 
why  hypermedia  constitutes  a  major  advance  over  other 
media: 

0  The  associal'rve,  nonlinear  nature  ol  hypermedia 
mirrors  the  structure  of  human  long-term  memory, 
empowering  both  intelligence  and  coordination  through 
Intercommunication. 

0  The  capability  of  hypermedia  lo  reveal  and 
conceal  the  complexity  ol  content  lessens  the  cognitive  load 
on  users  of  Ibis  medium,  thereby  enhancing  their  ability  to 
assimilate  and  manipulate  Ideas. 

0  The  structure  ol  hypermedia  ladlilates  capturing 
and  communicating  knowledge,  as  opposed  lo  mere  data. 

D  Hypemiedlc's  architecture  enables  distributed, 
coorrfinated  Interaction,  a  vital  component  of  teamwork, 
organizational  memory,  and  other  'group  minrT  phenomena 
|WARR9f]. 

Although  some  would  oppose  the  above  claims,  the  unique 
characterfslics  ol  STI  and  the  STI  community  sanre  to  make 
this  area  particularly  liable  lo  benefit  from  greater  use  ol 
hypennedia  technology. 


What's  Happening  Now? 

There  ore  a  multitude  of  hypermedia  developmenis  going  on 
In  the  transition  from  traditional  linear  information 
retrieval  to  octuto  Information  viewing.  This  is  happening 
In  udial  has  been  tormed  toe  multidimensional  Information 
Opaoe  |SEPEC9(H.  We  have  seen  from  the  above  how  iMs  is 
changing  the  way  wo  store,  retrieve,  and  use  inlormatian. 
Hyper-branching  appllcallons  are  being  experimeniad 
wlih  throughout  ihe  wholo  of  the  government,  academia, 
and  prtMla  todusky-  The  Idlawlng  are  but  a  few  axunplea 
rsprasantative  ol  what  to  taking  p^. 

The  PnpUmaiit  tWiiimwraMhin  liWMmmIrw  anatom  fHNS. 
EOlS  la  being  davatapad  by  Houston  Appisd  Logie,  Houaign, 
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Texas,  for  the  NASA  Life  Sciences  Protect  Division  at  NASA 
Johnson  Space  Center.  Houston,  Texas.  It  is  a  system 
designed  to  produce  and  control  the  Ltfe  Sciences 
Experiment  Document  (ED)  containing  targe  amounts  of 
text  in  combination  with  tables  and  graphs  of  mathematical 
and  scientific  data,  making  use  of  hypertext  concepts 
through  Macintosh  HyperCard.  The  ED  defines  all 
functional  objectives,  inflight  equipment,  consumables, 
measurements,  ground  support,  and  test  sessions,  along 
with  the  expected  results  of  the  experiments.  The  ED 
consists  of  16  chapters  plus  appendices.  There  is  a  fixed, 
or  boilerplate  text  In  some  sections  that  applies  to  any  Life 
Sciences  experiment  and  reference  table  formats 
concerning  experiment-specific  text  and 
mathematical/scientific  data.  Other  sections  contain 
experiment  data  tailored  for  each  experiment.  The  EOlS  is 
foreseen  as  being  the  first  step  in  the  automation  of  the 
process  required  for  defining  complete  packages  of  Life 
Sciences  experiments  for  the  Shuttle  missions  [MOOR90]. 

me  Selenees  Inieraellva  Informallon  Racall  (LSURK  This 
is  a  study  In  hypermedia  applications,  being  done  by  GE 
Government  Services,  Houston,  Texas,  for  the  Life  Sciences 
Project  Division,  Johnson  Space  Center.  LSMR,  through 
interactive  media  technologies,  provides  online 
information  aids  as  a  'job  performance  assistance.'  The 
technologies  are  integrated  into  a  computer  desktop 
workstation  environment  with  which  mission  or  payload 
specialist,  the  scientist,  the  engineer,  and  support  or 
administrative  people  are  familiar.  The  LSIIR  is  foreseen 
as  providing  assistance  in  Life  Sciences  Project  missions 
and  activities  such  as  development  and  testing,  science 
monitoring,  technical  lab  activities,  and  mission  testing. 
The  system  make  uses  Mac  SEs  running  integrated 
applications  of  HyperCard,  MacRecorder  Sound  System, 
MacDraw,  MacPaint,  Canvas,  and  MacroMind  Director. 
MacroMind  Director  enhances  graphics  display  and 
animation.  Clip  art  and  scanned  photos  are  part  of  the 
system's  information  base.  The  system  serves  as  a 
'trainer*  or  simulator,  it  provides  the  user  with  different 
sets  of  Information  to  change  variables  during  an  exercise, 
or  make  afterallons  to  procedures  and  configurations. 
LSIIR  has  passed  Its  proof  of  concept  stage,  and  is 
envisioned  as  an  online  system  (or  electronic 
documentation  and  Infomtalion,  and  electronic  training  and 
review  In  all  areas  of  the  NASA  Life  Sciences  Projecl 
activity  ICHRIS90]. 

Dadalon  Sunoort  System  Shell.  The  Carroll  School  of 
Management  Boston  College,  Chestnut  HUI,  Massachusetts, 
is  constructing  a  decision  support  system  (OSS)  on  a 
Madniosh  that  can  support  applications  in  a  variety  of 
fisids  such  as  engineering,  manufacturing,  and  Rnance. 
The  shell  provides  for  a  hypertext-style  Interface  tor 
navigating  among  OSS  application  models,  data,  and 
reports.  They  enhanced  the  traditional  notion  of  manual, 
predefined  hypertext  links  by  allowing  lor  hypertext 
connections  to  be  butt  'on  the  fly.'  The  'arm  'generalized 
hypertext'  is  appllsd.  In  the  sense  of  networking  links 
within  a  domain  of  multiple  documents.  Generalized 
hypertext  Is  a  logic-bassd  technique  for  automating 
hypertext  within  a  knowledge-based  decision  support 
environment.  Value  is  added  by  providing  the  hypertext- 
slylt  Intarfaca  to  tha  OSS  app^lon  without  an  author 
having  to  create  any  nodes  or  inks,  white  at  the  same  tima 
aiowlng  for  adding  comments  and  other  annotations.  An 
aailler  version  of  the  curram  davalopmont,  called  MAX, 
was  dona  tor  Iha  LI.S.  Coast  Guard  [BIEB90|. 

Kimwiadna  Base  Browser  (KBBI.  Curramly  under 
devatopmant  al  (ha  NASA  Johnson  Space  Canter  is  a 
hyparmada  tystam  tor  browsing  CLIPS  knowfadge  bases. 
CUPS  Is  C  Language  Integratad  Production  Systam,  an 
expert  system  she!  used  In  this  casa  to  create  knowtodge 
base  ex^  systems  of  ruies  that  control  the  ptocessas  of 
the  Onboard  Navigatton  (OIAV)  (Rghl  control  positton  at 
the  Mtoston  Control  Center  (MCC).  Theta  erqiM  systsms 
will  support  the  ascent,  rendezvous,  and  daorbft/landinp 


phases  of  a  Shuttle  mission.  The  KBB,  as  a  component 
program  of  the  MCC,  serves  to  assist  In  the  verification  of 
the  rule  bases  of  the  various  expert  systems,  and  to 
augment  the  training  of  the  flight  controllers.  When 
complete,  the  KBB  will  verify  and  browse  the  CLIPS  rule 
bases.  This  system,  which  in  the  view  of  its  creators  is  a 
hypermedia  system,  will  Include  the  capabilities  of 
automatic  creation  of  links  based  on  the  CLIPS  rule 
structure,  querying  the  rules  and  saving  the  results  as  a 
collection,  and  browsing  the  rule  bases  either  sequentially 
or  by  using  the  links  and  coilections  [POCK90). 

Ihe  Space  siallon  Freerfom  User  Interface  Lanmiane  fSSF 
UlU-  SSF  lUL  is  In  developmeni  at  the  Space  Operations 
and  Information  Systems  Division  of  Ihe  Laboratory  for 
Atmospheric  and  Space  Physics,  University  of  Colorado, 
Boulder.  It  Is  designed  for  use  by  the  astronauts,  ground 
controllers,  sclenliflc  investigators,  and 
hardwara/software  engineers  who  will  test  and  operate  the 
systems  and  payloads  aboard  the  space  station.  The  lUL  is 
objecl-orientsd.  English-like,  supplements  the  graphical 
user  Interface  to  systems  and  payloads  by  providing 
command  line  enliy,  and  will  be  used  to  write  test  and 
operations  procedures.  Hypertext  is  used  to  provide  links 
between  users  of  code  (statements,  steps,  procedures, 
etc.)  and  associaled  annotation  and  documentation,  linking 
coda  to  object  information,  and  linking  steps  within  a 
procedure  [DAVI90]. 

Artificially  Intellioenl  Graphical  Entity  Relation  Modeler 
(AiGerm).  AiGerm  Is  a  relallonal  database  query  and 
programming  language  fronlend  lor  Germ  (Graphical 
Entity  Relational  Modeling)  system.  These  systems  are 
being  developed  by  Microelecironics  and  Compule 
Technology  Corporaiion,  Software  Technology  Program 
(MCC/STP),  Austin,  Texas.  There  are  three  versions  of 
AiGerm  In  use:  Quintus  Prolog.  BIMprolog.  and  MCC's 
Logical  Dale  Language  (LOL).  AiGerm  is  intended  as  an 
add-on  component  of  the  Germ  system  to  be  used  (or 
navigating  very  large  networks  of  Information,  harnessing 
Prolog  cr  LOL's  relational  database  query  capabilities.  It 
can  also  function  as  an  expert  system  shell  tor  prototyping 
knowfedge-based  systems.  AiGerm  provides  an  Interface 
between  the  prograrnmlng  language  and  Germ.  When  a  user 
starts  up  AiGerm,  the  system  builds  a  knowledge  base  of 
currently  loaded  Germ  folio.  The  knowledge  base  is  a 
collection  of  node,  link,  and  aggregale  facts.  The  user 
que^s  Ihe  database  and  runs  programs  that  select,  create, 
delete,  kispecL  and  aggregate  the  nodes  and  links  appearing 
in  the  Gem  browser.  To  use  AiGerm,  the  user  first  starts 
up  Germ  and  loads  the  desired  hypertext  network  folio  into 
the  Germ  browser.  In  a  knowtodge  base,  tor  example,  tor 
each  hypertext  entity  -  f.e.,  node,  link,  and  aggregate  - 
AiGerm  asserts  a  fact  (a  Prolog  clause).  AiGerm  is 
currently  used  in  MCC/STP's  DESIRE  (DESign  Intormation 
REoovary)  systam  to  extract  totormallon  on  Ihe  design  cods 
tor  sollwara  systems.  Research  staff  are  experimenting 
with  AiGerm  in  building  IBIS  (Issue  Based  Information 
Sysisms)  -reasoning  and  decixion  support  systsms  lor 
sollwars  design  and  engineering.  Rockwell  International, 
an  MCC/STP  sharehoktof,  uses  AiGerm  in  a  simullaneous 
englneerlnB  project  IKC/STP  slates  that  users  of  AiGerm 
can  navigate  Germ  Networks  or  develop  prototypes  of 
knowtodge^Msed  hypermedia  systems  [HASH90). 

PROJECT  EMPERQR-I.  This  is  a  weH  known  hypermsdto 
projecl,  merging  microoomputer  and  videodisc  hybrid 
toclinotoatos.  B  hw  been  onqqina  sinca  1984.  It  is  amsjor 
research  and  developmeni  project  which  demonstralei  how 
new  technologies  enhance  better  understandtog  and 
apprectafton  of  a  subject.  In  ihto  case  Chinese  humanNIes, 
by  delivering  a  large-scale  online  (real-time) 
h^rmedla,  muHI-lormalted,  and  mulli-dimenslonal 
Intonnagow  simply  not  poesMs  in  sequsntW-lormansd 
systsms.  The  current  hypermedia  systems  Includes  an 
totsrasMre  kitormatton  delivery  model  tor  provIdInB,  at 
rapid  speeds  msaaursd  In  fiacHons  of  of  a  second,  raqusalsd 
relevani  totomwllon  to  any  tormrn  -  visual,  sudto,  textual 


-  as  salectsd  by  the  viewers  at  their  pace  and  choice, 
including  at  the  F^nt  of  need.  The  project  now  Includes: 

o  Two  12  inch  NTSC  CAV  videodiscs,  entitled  The  First 
Emperor  of  China:  Qin  Shi  Huang  Oi.* 

o  Interactive  courseware,  at  both  a  lay  public  and  a 
serious  researcher  levels.  Prototype  courses  have  been 
developed  tor  Digits  Equipment  Corporations'  IVIS  systems 
and  tor  IBM  PC  compatibles.  Later  systems  now  include 
the  Apple  Madntosh  Mac  lls. 

o  Electronic  Image  databases  tor  IBM  compatibles  and 
Mac  lls.  Further  developmeni  efforts  have  taken  place 
with  SOPHIATEC,  Nice,  France,  and  with  the  Project  Athena 
of  Massachusetts  Institute  of  Technology  involving  a 
powerful  multimedia  Image  system  using  DEC'S 
proprietary  MUSE  software  tor  high-end  machines  such  as 
DEC'S  MicroVax  and  IBM  RTs.  The  EMPEROR-i  hypermetSa 
system  has  also  been  looked  at  tor  use  on  Sun3s  and  Sun4s. 

0  High  resolution  Imaging  digitization  and  electronic 
imaging  has  been  performed  on  a  Sun3-160  using  OASIS 
software. 

o  Converting  and  creating  large  textual  files  with 
images  and  Chinese  characters  using  MicroTek's  MSF- 
3000  image  scanner  and  INOVATIC's  Readstar  II  Plus 
optical  character  recognition  software.  Digital  textual 
files  are  kept  in  the  hard  disks,  but  when  the  data 
approaches  400-500  megabytes,  CD-ROM  can  be 
produced. 

This  project,  housed  at  Simmons  College,  Boston. 
Massachusetts,  and  aided  by  many  interested  resources, 
both  industry  and  academic,  is  a  masterful  development. 
Professor  Chin's  goal  Is  to  show  that  computer  power, 
storage  techn  jlogy,  and  software  are  now  all  available,  at 
affordable  cost,  to  provide  the  opportunities  tot  innovative 
experimentation  of  ideas  in  education,  training,  research 
and  development  In  nearly  every  subject  held  (CHENSSj. 

Intermedia.  Brown  Universny's  Instllule  for  Research  in 
Information  and  Scholarship  (IRIS).  Providence,  Rhode 
Island,  has  developed  a  powerful  multi-user  hypermedia 
software  that  allows  professors,  students,  and  other 
knowledge  workers  to  create  and  follow  links  between 
electronic  documents  tor  different  types.  This  system  is 
named  'Intermedia.*  This  project  defines  hypermedia  as 
the  dynamic  nnking  of  data  such  that  related  data  Is  easily 
accessUe  although  the  actual  plecss  of  data  may  be  stored 
In  differeni  physical  locallons.  In  theory  the  data  can  be 
any  type,  such  as  text,  graphics,  spreadsheets,  video,  or 
audto.  Intermedia  provides  a  desktop  environment  similar 
to  that  tound  on  the  Macintosh.  The  desktop  contains 
applications  (or  tools)  such  as  a  word  processor,  a 
structured  graphics  erSMr.  a  historical  tlmslins  editor,  a 
scanned-lmage  view,  an  animation  editor,  a  videodlsk 
controller,  and  a  viewer  that  displays  and  rotates  three 
dimensional  models.  Users  (now  termed  'viewer*  or 
'authors'),  with  the  tools  just  enumerated,  enter  data  and 
link  signllicani  items  of  Information  together  lor  a 
contextual  viewing  of  that  Information. 

Because  of  the  extensive  differences  In  toe  storage  sources 
of  the  Information,  the  Intermedia  development 
Incorporated  two  new  concepts  In  the  handling  of  the 
Information,  the  'anchor*  and  the  'proxy.*  The  anchor 
concerns  maintaining  consistency  across  the  applications; 
an  anchor  Is  a  specific  selection  of  data,  a  part  of  a 
document,  with  the  surrounding  Information  used  to 
understand  its  significance.  When  a  user  follows  a  link, 
the  document  window  opens  to  the  size  and  tocadon  on  the 
screen  most  recently  saved,  and  automatically  scrolls  to 
the  secdon  that  reveals  the  anchor  with  its  surrounding 
Intormallon.  The  proxy  Is  an  Intermediary  concept  used  by 
the  viewer  tor  selecting  an  anchor  In  disparate  data 
sources,  s.g.,  text,  grap^,  sound.  The  use  of  the  date 
proxy  concept  allows  the  viewer  to  visualize  non- 
graphical  and  oonesptual  media,  to  have  simplicity  m 
Mdng  msM,  and  to  extend  systom  applications  to  related 
(Ma  types  (CATIM|. 


NASA  STI  Program  Hypermedia  Applications 

This  section  analyzes  the  Clinical  Practice  Library  of 
Medicine  (CPLM)  system  developed  under  NASA  Grant 
NAQ10-004t  and  partially  funded  by  the  NASA  STI 
Program.  The  CPLM  was  conceived  in  1979  by  a  team  of 
medical  and  computer  experts  from  the  University  of 
Florida  and  Kennedy  Space  Center.  Since  its  onset,  the 
system  has  evolved  from  a  mainframe-based  text  database 
to  a  microcomputer-based  hypermedia  system  that 
supports  both  text  and  high-resolution  medical  images. 
The  design  changes  necessary  to  expand  the  system  to 
Include  sound  and  animation  are  now  being  delineated. 

The  CPLM  system  is  currently  a  computerized,  rapid- 
reacting,  medical  reference  system  that  could  be  placed 
aboard  a  long-term  space  flight  to  provide  the  spacecraft 
physician  with  nearly  Instantaneous  access  to  the  most 
complete  medical  references  on  Earth.  With  this  type  of 
support  system,  the  physician  could  be  confident  that  ha 
was  makino  the  right  diagnosis.  The  demonstration  CPLM 
system  that  is  available  now  runs  on  an  IBM  PS2  Model  80 
microcomputer  with  a  high  resolution  S5t4A  Display  and 
a  1  gigabyte  disk  dnve.  The  system  is  programmed  in  C 
under  Microsoh  Windows.  The  system  contains  a  variety 
of  medical  texts  including  the  STI  Program's  special 
publication  NASA  SP-3006,  the  'Bioastronautics  Data 
Book.*  The  CPLM  systam  is  written  to  allow  expendability 
to  the  lull  capacity  of  the  available  storage  device.  Both 
traditional  and  hypennedia  access  to  the  information  Is 
permitted.  Traditional  Boolean  search  methods  are 
enhanced  by  a  parsings  dictlonaiy  unique  to  each  book  that 
holds  current  spellings  and  root  word  divisions  along  with 
a  lexicon  that  provides  a  book  specific  list  of  synonyms  and 
abbreviations  that  automatically  provides  alternate  search 
terms  to  the  user.  Word  and  phrase  linkage  among  all 
documents  is  provided  initially  by  the  University  of 
Florida  project  team  with  annotations  to  be  eventually 
added  by  the  physician  end  users.  The  educational 
capability  of  the  CPLM  system  may  be  one  of  its  major 
benefits  in  addition  to  its  ability  lo  deliver  complex 
information  In  a  user-friendly  fashion. 

Dr.  Ralph  Grams,  Universlly  of  Florida  developmeni  team 
leader,  stresses  that  the  planned  addition  of  voice 
activation,  animation,  and  interactive  hardware  can  make 
the  CPLM  system  function  as  a  fully  automated  physician’s 
assistant.  In  a  few  years.  Grams  sees  a  miniaturized 
hypennedia  CPLM  system  bulH  into  space  suits  and  canied 
by  Earthly  physicians  In  their  black  bags.  |GRAM911 

NASA  STI  Program  (STIP)  Hypermedia  Plans 

The  NASA  STI  Program  has  put  together  a  project  plan  to 
handle  the  developmeni  of  a  STIP  Multimedia  Initiative. 
The  Multimedia  Initiative  plan  covers  both  hypermedia 
applications  fully  controlled  by  the  user  and  marketing 
applications  that  present  mulllple  forms  of  information 
linked  to  each  other  with  limited  user  control.  The 
authoring  system  platform  to  be  procured  Is  a  Macintosh 
lllx  equipped  with  750  megabytes  of  storage,  a 
videocasselte  recorder,  a  CD-ROM  drive,  a  3Smm  slide 
scanner,  and  hantware  and  software  to  support  graphics, 
animalion,  sound,  and  real-time  video  capture  and  display. 
One  of  the  most  significant  hypermedia  applications 
planned  within  the  STI  Program  Is  the  NASA  STIP 
Demonstration  Electronic  Performance  Support  System. 
This  system  will  be  developed  at  NASA  STIP  and  wlH 
provide  the  proof-of-concept  necessary  to  demonstrate 
how  a  pertormance  support  system  can  transcend  the 
involvement  spans  of  individual  personnel  Integral  to  STIP 
projects.  This  demonstration  hypermedia-based  STIP 
associativs  memory  will  be  used  to  llluslrale  the 
pertormance  support  concept  to  NASA  scientists  and 
engfneers.  Later  phases  of  the  performance  support 
system  wM  provide  on-demand  traMng  In  addWon  to  a 
project  knowfadga  base,  hi  revamping  Its  services  to 
Include  muWmedla,  the  NASA  STIP  will  also  develop 
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procedures  to  handle  information  processing  requirements 
such  as  catalogir>g  of  a  new  media  and  appropriate  report 
documentation  pages.  In  making  a  commitment  to 
multimedia  and  hypermedia  development,  the  STI  Program 
ackrtowledges  the  application  of  hypermedia  technologies 
that  has  b^un  throughout  the  STI  community,  and  plans  k> 
foster  the  application  of  this  technology  to  the  removal  of 
barriers  to  the  transfer  of  scientific  and  technical 
information. 
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Abstract 

The  paper  dealt  with  the  technology  of  automated 
input  from  turidcn  (erl  into  datahatet. 

The  current  emphasis  in  office  automation  and  desk¬ 
top  publishing  helped  optical  character  recognition  to 
become  a  useful  and  affordable  method.  Different  ap¬ 
proaches  (pixel-oriented,  feature-based,  using  dictio¬ 
naries,  using  special  algorithms  for  special  problems' , 
character  transition  probabilities  or  grammars)  form 
the  basis  for  meeting  different  classes  of  requirements 
such  as  demanded  by  small  fonts,  large  varieties  of 
fonts,  special  characters,  different  layout  structures 
with  mixed  text  and  graphics,  difficult  printed  mat¬ 
ters  with  ligatures  and  kerning,  poor  printing  quality. 

On  the  way  to  an  automated  (or  semi- automated) 
input  process  from  printed  matters  into  databases^ 
OCR  is  nothing  but  a  necessary  step  within  the  line 
of  scanning,  character  recojnition,  descriptive  cata¬ 
loguing  and  (optional)  content  analysis.  Cataloguing 
here  is  to  be  understood  in  a  broader  sense  covert'n; 
the  identification  and  classification  of  the  relevant  pie¬ 
ces  of  input  and  the  normalisation  process  according 
to  database- specific  rules. 

The  role  of  efficient  input  into  bibliographic  databases 
is  discussed,  as  well  as  problems  and  techniques  of 
optical  character  recognition.  AUTOCAT,  a  software 
prototype  for  automated  cataloguing  is  introduced:  its 
foundation,  approach  and  user  interface. 

The  project  AVTOmatic  CATaloguing  (1985-1987) 
was  sponsonetf  by  the  BMFT  under  the  contract  no. 
10100170 


1  Information  Management 

1.1  Current  lYends 

The  way  to  produce,  organize  and  diitribute  infor¬ 
mation  has  been  rapidly  changing  since  the  perso- 

‘Broben  or  paoiod  vp  characters,  uadarhaed  weada 
’Data  iaterchange  ataadarda  like  SGML  or  ODIF  foim  the 
altematiee  at  aapplemealiaf  stralag} 


nal  computer  became  a  standard  office  tool.  Word 
processing  for  text  generation,  business  graphics  and 
desktop  publishing  for  information  presentation,  data 
bases  for  handling  factual  data  and  information  re¬ 
trieval  systems  for  storing  texts  are  in  widespread 
use.  Currently  a  major  trend  to  computer-networks 
copes  with  the  problem  of  communication  and  data 
exchange  which  nevertheless  still  is  a  problem.  Na¬ 
tional  and  international  standards  like  Office  Do¬ 
cument  Architecture/  Office  Document  Interchange 
Format’  (ODA/ODIF),  Standard  Generalized  Mar¬ 
kup  Language^  (SGML)  and  Electronic  Document 
Interchange  for  Administration  Commerce  and  IVans- 
port  (EDIFACT®)  are  major  efforts  towards  unham¬ 
pered  information  exchange. 

At  present  all  the  bright  perspectives  do  not  prevent 
paper  from  being  the  most  important  media  of  infor¬ 
mation  interchange.  A  technology  for  transforming 
written  text  on  paper  into  machine  readable  form  is 
available  and  more  and  mote  being  applied  the  last 
years.  Eve  i  Der  Spiegel,  the  Germans’  most  popu¬ 
lar  political  magazine  made  a  story  out  of  this  new 
trend,  not  without  emphasizing  remaining  problems®. 
Information  Retrieval  Systems  tend  to  offer  an  inte¬ 
gration  of  an  OCR-Software''  package*.  Effective  ap¬ 
plications  in  the  field  of  office  automation’  and  media 
documentation  centers'®  demonstrate  the  feasibility 
and  benefits  of  integrating  OCR  and  text  documen¬ 
tation. 


’•eels] 

^•ee  [9] 

^EDffACT  is  foing  to  become  an  intematieoal  standard. 
In  Germany,  a  draft  is  availaUe  as  DIN  16556 

^"VoUes  Rohr.  Sogeaannte  Scanner,  mit  denen  GednuAtea 
direkt  von  der  Vosiage  in  den  Computer  etngUesen  wird,  sind 
nun  audt  f&r  den  PC^Anwender  erachwhi^k^.*  Der  Spiegel 
n/911,pp.  240-243 

^Opti^  Charact«' Recognitioii 

*To  give  Mat  exampka:  AskSam  and  Optopua,  MegaStore 
and  iBS  GigaRead,  Darwin  and  ReadStar 

*Fhr  exasnpla  at  ibe  Riiiidi  saint  ah  fur  Fls^dierung  in 
FVamkAirt/M 

^^For  example:  at  Qruner+Jahr,  updating  a  databaae  of 
journal  articles  or  at  BMW  AG  Mfincken,  building  a  full  text 
infcHmation  system  cm  new  technotogies(see  [15]) 
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1.2  The  Importance  of  the  Input  into 
Bibliographic  Data  Bases 

The  user  of  bibliographic  databases  might  not  be 
aware  of  the  fact  that  in  some  respects  not  the  re¬ 
trieval  process  but  the  input  process  actually  is  the 
critical  operation**; 

•  not  only  economically,  since  at  least  more  than 
70  %  of  all  costs  of  a  documentation  installation 
accrue  from  here, 

•  not  only  quantitatively,  since  a  database  like 
PHYS  at  the  Fachinformaiionsztntntm  Karls- 
rvhe  has  an  input  of  about  10,000  new  documents 
per  month, 

■  but  above  all  qualitatively,  since  topicality,  com¬ 
pleteness  and  reUability  of  a  database  as  well  as 
quality  and  stability  of  subject  indexing  are  de¬ 
cisive  criteria  for  a  potential  client. 

If  the  costs  of  a  database  are  high  compared  to  the 
income,  if  these  costs  mainly  originate  from  the  input 
and  if  a  lower  level  of  the  database’  quality  and  quan¬ 
tity  make  the  value  of  a  database  questionable,  then 
powerful  means  have  to  be  found  to  increase  the  effec¬ 
tiveness  in  this  area.  Clearly  successful  results  have 
been  achieved  by  cooperation  and  data  exchange 
with  producers  of  primary  information  and  with  other 
information  offerers,  by  using  profitable  market  prices 
e.g.  for  the  service  of  writing  offices  and,  last  but  not 
least,  by  a  efficient  organisation  and  by  appropriate 
software  with  powerful  functions  (see  (14]).  In  this 
paper,  however,  we  shall  focus  on  the  argument  that 
even  better  results  can  be  achieved  more  effectively 
on  the  basis  of  OCR  and  expert  system  technology. 

1.3  From  paper  to  database  records:  a 
multistage-transition 

Although  hardware  and  software  for  an  automated 
input  into  databases  are  available  in  principle,  un¬ 
experienced  users  seem  to  underestimate  this  pro¬ 
cess.  The  DER-SPIEOEL-report,  for  example,  descri¬ 
bes  some  of  the  typical  problems  and  it  ends  with  the 
statement 

”. . .  even  more  effort  (Aan  for  the  sconninj 
process  itself  is  necessrry  to  prepare  the  texts 
(11,006  popes)  for  a  detoAose*’’. 

rr  This  section  is  mainly  taken  from  [13],  also  dealing  with  au- 
tomated  input  into  bibUofrapkic  data  bases  but  cosicentratmg 
on  aspects  of  automatic  (subject)  indexing. 

'’"Noch  weit  mehr  Aufnand  als  das  Scaonen,  so  letnte  dee 
Strafverteidiger  Kirdi,  wurde  as  boeten,  die  Ibxte  dee  Akten- 
lumvolute  fur  etne  Pafenhesit  aufsnaebeiten.  Wenn  dee  PC 
jedeemal  die  gansen  13.000  Ssitea  hatta  durcbsudien  mdeem, 
ware  dae  kein  groOer  Forteebritt  geweaen.” 


A  stepwise  transformation  process  from  written  ma¬ 
terial  into  a  machine  readable  database  format,  well 
prepared  for  retrieval,  can  be  described  as  following 
(see  fig  1): 

Graphical  level.  A  scanner  is  an  input  device,  map¬ 
ping  the  printed  paper  into  a  matrix  of  pixel  ele¬ 
ments  (pixel).  This  matrix  is  stored  as  a  file 
which  could  be  manipulated  by  operations  like 
those,  offered  by  programs  for  drawing.  The  data 
structure  supports  its  presentation  as  a  facsimile 
representation  of  the  original  document  at  a  high 
resolution  screen.  It  cannot  be  used  for  retrieval 
without  additional  data. 

Basicly  a  scanner  can  be  specified  by  its  resolu¬ 
tion  (dpi  =  dots  per  inch,  i.e.  pixels/area)  and 
the  data  type  of  pixels:  boolean  (black/wbite),  a 
spectrum  of  grey  intensities  or  of  colours. 

Other  features  concern  the  size  and  the  transport 
of  the  paper,  and  its  ability  to  cope  with  different 
surface  qualities  of  papers. 

Character  level.  An  optical  character  recognition 
software  identifies  collections  of  pixels  to  repre¬ 
sent  single  characters  (see  section  2). 

The  result  will  be  stored  as  an  ASCII-file  (or  can 
be  transformed  into  formats  used  by  word  pro¬ 
cessing  or  calculation  software).  It  could  be  used 
for  (sequential)  searching,  like  editors  do.  But 
it  is  unable  to  differentiate  between  Information 
Management  as  a  phrase  in  the  title  and  the  jour¬ 
nal  Information  Management  which  might  be  the 
source  of  the  document. 

Formal  document  structure  level.  In  general, 
bibliographic  databases  are  based  on  a  category 
scheme  for  documents.  Relational  databases  use 
even  more  elaborate  schemata.  Descriptive  cata¬ 
loguing  maps  the  input  data  into  a  representation 
fitting  the  predefined  scheme.  It  is  a  well  un¬ 
derstood  task  within  the  framework  of  professio¬ 
nal  input  of  literature.  It  is  to  be  understood  in 
an  even  broader  sense  covering  the  identification 
and  classification  of  the  relevant  pieces  of  input 
(for  example  to  identify  "Gerhard  E.  Knorz”  to 
be  the  author  of  the  paper)  and  the  normalisation 
process  according  to  database-specific  rules  (such 
as  transforming  "Gerhard  E.  Knorz"  to  "Knorz, 
G.E."). 

Content  level.  Until  now  a  document  in  a  database 
consists  of  a  set  of  categories  each  of  which  is 
composed  by  a  sequence  of  words.  Every  single 
word  is  nothing  but  a  character  string  between 
delimiters.  Basicly,  the  standard  technique  of  se¬ 
arching  combines  elementary  search  patterns  by 
operators  of  Boolean  Logic. 

Manual  content  analyses  aims  to  support  the 
search  for  relevant  documents  by  adding  de- 
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Figure  I:  Four  steps  from  written  material  into  a  data  base 


scriptors  (often  taken  from  a  controlled  voca¬ 
bulary  like  a  thesaurus)  or  classification  codes. 
Automated  indexing  may  simulate  this  approach 
(see  [6])  or  may  support  retrieval  by  means  of 
other  statistical  or  linguistical  techniques. 


This  decomposition  of  an  automated  input  process 
into  single  steps  reflects  the  underlying  conceptual 
structure  of  the  task.  Depending  on  the  degree  of 
software  integration,  the  user  will  not  realize  the  first 
or  even  the  second  step  to  form  a  independent  step. 
Normally,  the  third  step  is  very  poorly  performed: 
the  results  of  the  OCR-processing  are  assumed  to  be 
the  content  of  one  single  data  field  (category)  and 
the  document  record  is  manually  supplemented  with 
additional  data.  Section  3  describes  an  experimental 
but  well  elaborated  solution.  In  [11]  p.  59  a  new  soft¬ 
ware  product  is  announced,  which  seems  to  realise  a 
very  simular  approach. 

For  specialised  and  restricted  applications  dedicated 
solutions  make  sense:  i.e.  the  OCR-Software  Giga- 
retd,  especially  developed  for  reading  adreas  books, 
identifies  the  different  parts  of  an  adress  item  and 
forms  a  structured  database  entry  as  a  result. 

The  fourth  step  can  be  discussed  quite  independently 
firom  the  aspect  of  applying  OCR-techniquea.  We  do 
not  look  into  this  aspect  here  (further  readings:  [12], 
[13]  or  [6]) 


2  OCR  —  problems  and  tech¬ 
niques 

2.1  Types  and  sources  of  potential 
problems 

Some  typical  problems  will  be  summarized’^  in  short. 
Documents  often  make  use  of  a  variety  of  fonts  and 
sizes.  Expecially  small  (6pt  and  smaller)  and  luge 
fonts  (24pt  and  larger)  may  cause  problems.  The 
same  applies  to  special  fonts’*  (i.e.  italics),  special 
chuacters,  a  large  number  of  different  fonts  within 
one  document,  very  small  distances  between  chuu- 
ters,  kerning,  the  use  of  ligatures  and  of  underlining. 

Other  types  of  probleiiu  arise  from  (poor)  printing 
quality:  chuuters  maybe  broken  into  pieces,  causing 
”m”  to  look  like  ”rn"  for  example.  On  the  other  hand 
two  chuacters  may  merge  into  one.  Small  pieces  of 
chuacters  may  be  omitted  (like  the  upper  part  on 
the  "i”  or  may  be  present  as  some  kind  of  noise  only. 
Handwritten  additives  may  interfer  with  original  lines 
of  text  as  well  as  background  noise  and  patterns. 

The  more  creative  the  layout,  the  more  difficult  is 
the  correct  interpretation.  We  can  expect  graphics 
and  text  to  be  sepuated  automatically  (or  manually 


’*The  program  committee  recommended  to  coDcentrate  on 
the  aspect  of  descriptive  catalogiung.  According  other  tepica 
had  to  be  shortened. 

’*  A  particidar  type  of  ^oblem  with  eome  fonts  is  the  destin- 
ction  between  ’ I",  "r  and  "r 
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controlled).  Columns  of  text  ate  to  be  recognized  and 
read  in  a  meaningful  sequence. 

For  further  processing  the  information  on  the  original 
fonts  and  sizes  may  be  preserved  or  not,  depending 
on  the  system  used  and  the  needs  of  the  application. 

2.2  Approaches  and  techniques  to 
character  recognition 

A  variety  of  stategies  try  to  cope  with  the  require¬ 
ments  of  different  application  types. 

Pattern  matching  algorithms 

basicly  compare  the  pixel  image  of  a  character 
with  standard  images  out  of  a  training  set  of  cha¬ 
racters.  These  algorithms  have  the  advantage  to 
be  of  a  fast  performance  and  to  be  trainable  to 
arbitrary  fonts  and  character  sets.  The  disadvan¬ 
tage  is  that  they  are  not  independent  of  character 
size  and  font,  so  normally  a  training  phase  must 
procede  the  reading  phase. 

Feature  recognition  based  algorithms 

try  to  identify  the  interpretation  of  a  character 
by  means  of  detected  fearures*®  (loops,  crossings, 
lines).  They  depend  on  fixed  character  sets,  but 
are  quite  independant  from  size  and  fonts.  Nor¬ 
mally,  training  is  not  necessary  and  sometimes 
not  even  possible. 

Training  of  unknown  characters  or  known  charac¬ 
ters  with  unfamiliar  shapes  may  be  necessary, 
possible  or  impossible.  A  dictionary  may  be  used 
for  (additional)  automated  training'*. 

A  conflict  set  of  characters,  as  a  special  kind  of 
training  set,  might  support  the  discrimination  of 
similar  characters. 

Context  zmidysis  can  be  spp>>ed  to  introduce  argu¬ 
ments  of  the  type:  the  letter  ”0”  in  the  context 
of  numbers  most  likely  turns  out  to  be  a  ”0” . 
Another  method  is  to  apply  a  pattern  matching 
filter  a  posteriori,  detecting  typical  recognition 
errors  in  the  produced  text  file. 

The  system  AUTOCAT,  described  in  the  next  sec¬ 
tion,  made  use  of  the  OCR  workstation  KDEM  1.200. 
This  machine  can  be  trained  for  different  fonts  and  si¬ 
zes  and  it  preserves  this  kind  of  information  together 
with  the  recognized  character  itself  for  further  pro¬ 
cessing.  KDEM  1.200  successfully  has  been  tested  in 
the  course  of  the  AUTOCAT  project  and  it  was  the 
only  appropriate  OCR  system  available  at  that  time 
(1985).  In  this  aspect,  the  situation  changed  totally. 

Some  produces  try  to  establish  the  phr  -se  Intelhftnt  CKt- 
rsefer  Recofnition  (tCR) 

'*The  chsuactets  of  recognised  words,  found  in  the  dictio¬ 
nary,  are  choosen  as  a  training  set.  So.  the  systems  adopts  the 
spedfir  features  of  the  copy  on  hand,  b  increases  reliafailtity 
end  speed. 


3  An  Expert  system  approach 
to  automated  cataloguing 

The  result  of  formal  cataloguing  is  a  formally  structu¬ 
red  document  description,  derived  from  the  primary 
publication'^.  In  order  to  perform  the  tasks  in  the 
fields  of  library  and  documentation  and,  last  but  not 
least,  to  allow  data  exchange  between  different  infor¬ 
mation  facilities,  extensive  sets  of  rules  like  the  Anglo- 
American  Cataloguing  Rules  (AACR  2,620  pages)  for 
example  were  established.  The  present  high  degree 
of  explicit  standardization  and  the  fact  that  this  is 
a  non-trivial  labour-intensive  task  including  a  great 
share  of  routine  situations  suggest  the  development 
of  a  knowledge-based  system  for  formal  cataloguing. 
This  idea  was  taken  up  at  different  places  in  the  last 
decade  (see  [3]).  Simpler  systems  only  support  the 
standardization  of  data  entries.  If  a  knowledge-based 
system  is  supposed  to  cover  the  whole  process  of  for¬ 
mal  cataloguing  [10],  [17],  [18],  then  the  interest  must 
focus  on  the  first  step,  namely  the  identification  and 
categorization  of  different  elements  of  the  input. 

AUTOCAT'*  (AUTOmated  CATaloguing),  to  be  de¬ 
scribed  in  more  detail,  means  both,  project  and  sy¬ 
stem  (see  [?]).  It  certainly  was  one  of  the  largest 
efforts  in  the  field  and  made  a  serious  attempt  at 
practicle  applications. 

•  It  adheres  to  the  relatively  simple  cataloguing  ru¬ 
les  of  the  International  Nuclear  Information  Sy¬ 
stem  INIS. 

•  It  relies  on  knowledge  about  document  types,  in 
particular  on  an  empirical  information  structure 
of  physics  journals. 

•  The  prototyp  of  a  cataloguer’s  working  place  has 
been  developed  providing  a  user  interface  that 
allows  to  control  the  system  and  to  correct  its 
output. 

AUTOCAT  was  developed  in  two  phases:  The  first 
phase  at  the  Technical  University  of  Darmstadt  ended 
with  a  Prolog  prototype  that  showed  the  feasibility  of 
the  relatively  simple  cataloguing  of  articles  from  phy¬ 
sics  journals.  In  its  second  phase,  a  soRware  house 
took  over  the  project.  During  this  time  AUTOCAT 
made  some  steps  towards  a  marketable  software  pro¬ 
duct.  This  prototype  was  presented  at  the  Hannover 
Fair  ’89. 


"A  much  more  detaiUed  publication  on  this  lection't  sub¬ 
ject  is  going  to  be  completed  ([5]).  The  following  material  is 
adopted  from  this  publication. 

**The  project  AUTOmatic  CATaloguing  (1985-1987)  was 
sponscued  by  the  BMFT  under  the  cmttract  no.  lOZOOlTO 


3.1  The  AUTOCAT  approach 

AUTOCAT  produces  records  for  a  bibliographic  da¬ 
tabase  and  its  prototypical  application  environment 
is  INIS.  In  contrast  to  cataloguing  assistants,  AU¬ 
TOCAT  comprises  the  whole  process  of  descriptive 
cataloguing. 

AUTOCAT  was  first  developed  for  cataloguing  ar¬ 
ticles  from  physics  journals.  Altough  its  concept  has 
been  extended  to  report  and  monograph  cataloguing, 
the  AUTOCAT  approach  still  is  best  explained  using 
the  core  application  for  scientific  journals. 

AUTOCAT  catalogues  periodical  articles  in  two  main 
steps: 

1.  It  recognizes  information  elements  in  the  ma¬ 
chine  readable  jc>  irnal  as  detailled  as  it  is  ne¬ 
cessary  for  cataloguing  under  INIS  cataloguing 
rules. 

2.  It  normalizes  the  information  elements  as  stipu¬ 
lated  by  INIS  rules  and  enters  them  into  the  tar¬ 
get  record,  defined  by  the  categories  of  the  INIS 
worksheet. 

AUTOCAT  starts  its  work  with  a  first  representation 
of  the  journal  article,  generated  as  output  of  an  OCR- 
processing  step.  Simple  layout  features  like  fonts  and 
distances  of  tokens  are  used  to  detect  input  blocks 
that  are  candidates  for  functinal  roles  like  UiU,  aut¬ 
hor  etc.  (see  fig.  2).  The  targets  of  recognition  arc 
defined  by  the  information  structure  of  the  journal 
under  processing.  Network  grammars  are  bound  to 
that  target  structure  (see  fig.  3)  and  interpret  the 
input  representation  with  the  help  of 

•  the  known  information  structure  of  articles  in  the 
journal  augmented  with  the  relevant  items  of  the 
journal  itself. 

•  lists  of  keywords  for  special  categories  like  affilia¬ 
tions. 

The  categories  of  the  INIS  worksheet  form  the  final 
target  structure,  to  which  normalization  rules  are  bo¬ 
und.  They  capture  their  input  from  the  stored  inter¬ 
mediate  results  and  transform  it  in  adherence  to  INIS 
rules. 

3.2  The  AUTOCAT  representation  of 
physics  journals 

Since  cataloguing  rules  like  AACBS  tell  a  cataloguer 
what  to  look  for,  they  most  define  abstract  expecta¬ 
tions  about  the  journal  or  any  other  document  type; 
So,  implicitly  they  define  a  normalized  informational 
model  of  the  document  type.  On  the  basis  of  such  an 


abstract  document  model,  one  can  characterize  and 
recognize  an  occuring  document  as  a  realization  of  an 
abstract  information  object.  The  AUTOCAT  repre¬ 
sentation  of  journals’  information  structures  expands 
these  ideas.  It  uses  categories  taken  from  cataloguing 
rules  like  INIS  or  AACR2  (e.g.  title  proper,  aut¬ 
hor,  affiliation),  completed  only  as  fru  as  necessary 
by  other  corrstructs  that  cater  for  observed  data  (e.g. 
head  of  article  or  page  frame).  The  representation, 
developed  by  AUTOCAT,  is  the  result  of  an  empiri¬ 
cal  investigation  of  330  core  journals  of  physics.  In 
a  recent  copy  of  every  journal,  title  page,  table  of 
contents  and  the  first  and  the  last  page  of  at  least 
one  paper  were  exploited  for  data  about  the  occu¬ 
ring  information  elements,  their  sequence,  structure 
cues  like  separators,  fonts  and  spaces.  An  additional 
test  using  complete  volumes  of  40  journal’s  complete 
volumes  supplied  reassuring  results:  the  information 
structure  of  physics  journals  is  reasonably  stable,  (see 
[2]). 

Some  simple  observations  help  to  specify  the  techni¬ 
cal  features  of  a  representation  device  for  information 
objects  as  it  is  needed  here: 

Since  information  objects  are  mainly  defined  by  their 
composition,  the  part- whole  relation  will  be  the 
backbone  of  the  representation.  Generic  relations  ap¬ 
pear  less  important. 

"Natural”  objects  in  a  reader’s  eyes,  like  an  article  or 
a  table  of  contents,  may  be  very  large  in  terms  of  kno¬ 
wledge  representation  constructs  (such  as  frames).  A 
facility  for  the  adequate  representation  of ’’complex 
objects”  must  be  provided. 

An  information  object  may  have  multiple  instances: 
A  paper  may  have,  for  instance,  a  couple  of  authors, 
but  not  a  couple  of  tides.  So,  the  description  of  in¬ 
dividual  elements  has  to  include  a  cardinality  re¬ 
striction. 

Information  objects  may  occur  repeatedly  for  the  rea¬ 
der’s  orientation,  and  thus  provide  useful  redundancy 
for  the  recognition  process.  It  must  be  possible  to 
state  equivalence  relations  of  this  type  between 
information  objects. 

As  recognition  means  to  find  an  instance  of  an  in¬ 
formation  object  in  the  actual  document  copy,  it  is 
reasonable  to  associate  recognition  methods  to  ap¬ 
propriate  information  objects. 

Scientific  journals  differ  in  their  presentation  without 
being  heterogeneous:  They  combine  the  available  de¬ 
sign  solutions  in  an  individual  way.  The  necessary 
individualiziation  of  periodicals  can  be  achieved  by 
typing  their  components.  The  individual  journal 
is  described  by  the  set  of  types  of  its  parts.  It  must  be 
possible  to  store  this  type  of  information  and  include 
it  into  the  abstact  representation  of  the  information 
structure  before  processing. 
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[§!.  Introduction  j 

j  Current  ultra-thin-film  technology  permits  the  fabrica- 
jtion  of  new  composite  materials  with  extremely  fine  struc- 

Figure  2:  Header  of  an  article  with  rrarked  blocks 
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Figure  3:  A  simplified  Network  Gramm«r(Oiscussron  in  the  subsection  on  grammars) 
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For  the  implementation  of  a  representation  that  co¬ 
pes  with  these  requirements,  a  frame  language  can  be 
used,  embedded  into  a  knowledge  engineering  envi¬ 
ronment. 

3.3  Selected  aspects  of  the  AUTO¬ 
CAT  prototype 

3.3.1  Grammsirs  for  analyzing  information 
elements 

AUTOCAT  used  hierachies  of  ATN-like  transition 
grammars  (see  fig.  3).  The  nodes  and  links  are  written 
in  a  user-oriented  Prolog  notion  which  subsequently 
is  translated  into  a  Prolog  module.  A  grammar  takes 
two  types  of  parameters:  a  specification  of  the  input 
(e.g.  an  author  name  grammar  must  not  cross  a  block 
broader*^)  and  a  selection  of  a  particular  alternative 
(sub-grammar  for  a  particular  object  type).  The  re¬ 
sults  are  transferred  via  special  output  registers. 

We  can  make  a  difference  between  types  of  structures 
for  which  grammars  are  to  be  defined.  It  is  necessary, 

•  to  specify  the  structure  of  document  types,  e.g. 
page  layout,  block  structure  (macro  structure) 

s  to  specify  the  syntax  of  different  information  ob¬ 
jects,  e.g.  author  block,  affiliation  block  (micro 
structure) 

[16]  describes  the  design  of  an  GEM-based^°  imple¬ 
mentation  of  a  graphical  interface  (see  fig.  4),  which 
can  be  used  to  compile  an  actual  macro  structure  of  a 
document  type  from  a  set  of  single  components.  The 
user  models  a  document  type  as  a  sequence  of  pa¬ 
ges.  (S)he  selects  primitive  or  composed  information 
items  like  title,  abstract,  author  and  so  on  and  places 
them  on  the  page.  Each  object  (page,  title,  etc.)  is 
represented  by  a  named  box.  This  way,  the  presenta¬ 
tion  of  the  part-whole-relation  is  quite  natural.  The 
user  is  able  to  formulate  alternatives  and  may  define 
a  hierarchy  of  substructures.  The  final  schema  repre¬ 
sentation  of  a  real  document  page  is  translated  into 
Prolog  facts. 

From  our  experience  with  grammars  we  do  not  feel 
that  they  form  a  reliable  tool  for  being  used  by  the 
enduser:  Complex  grammars  as  are  necessary  for  ana¬ 
lyzing  the  micro  structures  of  this  application  cannot 
be  modified  without  extensive  testing.  It  is  not  pos¬ 
sible  to  decide  whether  a  rule  is  absolete,  wrong,  or 
correct,  just  by  looking  at  a  simple  rule  which  seems 
to  cause  problems,  without  recognizing  the  context  of 
the  whole  grammar.  In  [7]  another  way  for  defining 

^’Blocks  and  its  brosders  are  defined  in  a  preprocessing  phase 
by  means  of  layout  features. 

^Graphics  Environment  Manager.  A  trademark  of  Digital 
Research  Inc. 


syntactical  structures  is  proposed  and  implemented. 
It  starts  from  the  observation  that  a  user  might  mo¬ 
dify  a  grammar  typically  in  a  situation  where  the  sy¬ 
stem  cannot  cope  with  a  particular  example.  Why 
should  this  very  example  not  be  used  to  teach  the  sy¬ 
stem?  So  the  grammar  as  the  internal  representation 
of  syntactical  structures  is  incrementally  constructed 
from  a  set  of  examples.  The  user  types  in  the  exam¬ 
ple,  the  system  breaks  down  the  input  to  a  sequence 
of  tokens. 

The  user  has  to  decide  for  which  token  in  the  example 
which  level  of  abstraction  is  appropriate.  Afterwards 
(s)he  has  to  teach  the  system  the  constituent  struc¬ 
ture  of  the  examples  (see  fig.  5).  The  system  has 
to  transform  the  information  given  by  the  user  into 
grammar  rules  and  has  lo  ensure  that  no  conflicts 
with  existing  rules  arise.  The  implementated  version 
of  the  system  translates  each  example  into  one  corre¬ 
sponding  (complex)  grammar  rule.  This  reference  to 
an  actual  example  should  tell  the  user  much  more  on 
the  reason  of  an  underlying  problem  than  an  abstract 
grammar  rule. 

3.3.2  The  user  interface  of  the  Knowledge 
Craft  prototype 

The  shift  from  journal  cataloguing  to  report  and  mo¬ 
nograph  cataloguing  caused  some  substantial  chan¬ 
ges  in  the  architecture  of  the  AUTOCAT  system.  A 
basically  frame-oriented  expert  system  development 
tool.  Knowledge  Craft,  is  used  for  the  implemen¬ 
tation.  Knowledge  Craft  offers  backwards  chaining 
(CRL-  PROLOG)  and  forward  chaining  rules  (CRL- 
OPS).  In  the  following,  we  shall  focus  on  the  user 
interface. 

The  interface  handle-  the  interaction  betweeen  the 
user  (cataloguer)  and  the  AUTOCAT  system.  There 
ue  several  functions  which  should  be  provided  by  the 
user  interface.  The  main  task  is  to  control  and  correct 
the  cataloguing  results.  To  control  the  results,  a  do¬ 
cument  facsimile  is  reconstructed  on  the  basis  of  the 
stored  layout  information,  and  a  representation  of  the 
corresponding  INIS-  worksheet  entries  is  displayed  on 
the  screen.  To  correct  the  results,  the  user  must  be 
able  to  edit  the  worksheet  entries. 

The  initial  window  on  the  screen  contains  a  list  nf 
all  newly  processed  documents  which  have  not  been 
handled  by  the  cataloguer.  The  documents  are  clas¬ 
sified  into  recognized  and  unrecognized  documents. 
‘Unrecognized’  is  a  label  for  documents  with  obliga¬ 
tory  worksheet  entries  missing”.  Clicking  at  a  docu¬ 
ment  identification  with  the  mouse,  AUTOCAT  sel¬ 
ects  the  document.  For  interaction,  several  command 
buttons  are  offered  (e.g.  edit- worksheet,  change-level, 
next-page). 

IB  poBuble  th*t  AUTOCAT  wab  not  able  to  find  oU  the 
necewory  informfttion  in  the  input. 
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The  user  can  edit  the  worksheet  or  invoke  an  intel¬ 
ligent  copy  function  by  activating  a  worksheet  tag 
and  subsequently  marking  a  passage  of  the  document 
faximile^^. 

The  selected  passage  is  transformed  according  to  INIS 
rules  and  copied  into  the  active  entry. 


4  Conclusions 

Both,  the  Prolog  and  the  Knowledge  Craft  prototype 
contribute  to  different  aspects  on  the  way  to  an  auto¬ 
matic  cataloguing  system  suitable  for  practical  appli¬ 
cation.  It  took  about  3  years  for  a  uni  verity  research 
group  to  develop  the  foundations  and  the  first  Prolog 
prototype.  It  showed  the  feasibility  of  the  relatively 
simple  cataloguing  of  articles  from  physics  journals 
on  a  broad  empirical  basis. 

It  took  about  half  a  year,  to  designe  and  implement 
the  Knowledge  Craft  prototype.  Its  purpose  was  two¬ 
fold:  It  could  be  demonstrated,  that  knowledge-based 
automatic  cataloguing  can  serve  as  a  helpful  catalo¬ 
guers’  tool  if  it  provides  an  appropriate  user  interface. 
Furthermore,  the  implementation  showed  explorato- 
rily  that  the  modified  approach  for  recognition  sup¬ 
ports  more  difficult  cataloguing  tasks. 

More  generally  speaking,  the  AUTOCAT  project  de¬ 
monstrated  the  soundness  of  knowledge-based  con¬ 
cepts  in  the  field  of  the  information  market.  It  sho¬ 
wed  that  OCR  and  expert  system  technology  may 
very  well  help  to  solve  the  practical  problem  of  the 
information  industry. 

The  potential  applications  of  the  AUTOCAT  tech¬ 
nology  is  not  limited  to  the  formal  cataloguing  of 
document  types  described  in  this  article.  The  tech¬ 
niques  developed  during  the  AUTOCAT  project  are 
also  applicable  for  other  document  types,  e.g.  telex, 
electronic  mail,  technical  documentation. 

Quite  naturally,  still  there  are  limitations.  Whether 
or  not  current  OCR  technology  can  meet  specific 
application  demands  has  to  be  proven  by  carefull 
pretests.  But  they  stand  a  good  chance.  On  the 
other  side,  AUTOCAT’s  document  type  is  quite  sta¬ 
tic  (compared  with  the  flexibility  supported  by  SGML 
and  ODA).  Nevertheless,  it  should  be  clear  that  the 
demands  of  routine  tasks  can  be  fulfilled  on  the  basis 
of  available  methods  and  technologies. 


Acknowledgements 

This  contribution  is  based  upon  the  results  of  different 
projects  sponsored  by  the  BMFT  (German  Depsrt- 

incfaHHii  of  a  saapsliot  of  (he  inUfiaoe  had  to  canceled 
becenee  of  poor  hnage  qaality 


ment  of  Research  and  Technology),  in  which  many 
colleagues  were  engaged. 

As  indicated,  the  section  on  AUTOCAT  is  a  shorte¬ 
ned  and  modified  version  of  parts  of  a  paper  which 
I  am  currently  preparing  together  with  several  co¬ 
authors:  Brigitte  Endres  Niggemeyer,  Ulrike  Rauth 
and  Heinz  Marburger. 

I  want  to  express  my  sincere  thanks  to  all  persons 
involved. 


References 

[1]  Appelt,  W.,  "Normen  im  Bereich  der  Do- 
kumentverarbeitung” ,  Informatik-Spektrum 
(1989)12:  pp.  321-30 

[2]  Below,  A.  v.,  ” Funktionsgerechte  Struktu- 
rierung  von  Fachzeitschriften” ,  Nachr.  Dok. 
38(1987),  pp.  343-349 

[3]  Davies,  R.  and  James,  B.,  "Towards  an  Ex¬ 
pert  System  for  Cataloguing:  some  Experi¬ 
ments  Based  on  AACR  2”,  Program  18  (1984) 
4:  pp.  283-297 

[4]  Endres-Niggemeyer,  B.  and  Knorz,  G.:  "AU- 
TOCAT:  Wissensbasierte  Formalkatalogisie- 
rung  von  Fachzeitschriften”,  in:  Brauer,  W.; 
Wahlster,  W.  (eds.):  "Wissensbasierte  Sy- 
slcme”,  2"^  Int.  GI-Kongress  1987  Berlin: 
Springer,  1987,  pp.  53  -  62. 

[5]  Endres  Niggemeyer,  B.;  Knorz,  G.;  Marburger 
H.;  Rauth,  U.,  "AUTOCAT  —  a  Knowledge- 
Based  Approach  for  Descriptive  Cataloguing” . 
To  appear. 

[6]  Fuhr,  N,;  Hartmann,  S.;  Lustig,  G.;  Schwant- 
ner,  M.;  Tzeras,  K.;  Knorz,  G.,  "AIR/X  — 
a  Rule-Based  Multistage  Indexing  System  for 
Large  Subject  Fields” ,  in;  Research  and  Deve¬ 
lopment  in  Information  Retrieval  (Proc.),1991 

[7]  Gotz,  P.,  ”  ‘Learning  by  Examples’  als  Strs^ 
tegie  zur  Wissensaquistion.  Entwurf  und  Im- 
plementierung  einer  Benutzeroberflache  zur 
Bescbreibung  syntaktischer  Strukturen”,  Di¬ 
ploma  Thesis.  TH  Darmstadt,  Dep.  of  Com¬ 
puter  Science,  1989 

[8]  "Information  Processing  -  Office  Document 
Architecture  (ODA)  and  Interchange  Format” , 
ISO  8613,  International  Organization  for  Stan¬ 
dardization,  Geneva,  1989 

[9]  "Information  Processing  -  Standard  Generali¬ 
zed  Marup  Language  (SGML)”  ISO  8879,  In¬ 
ternational  Organization  for  Standardization, 
Geneva,  1989 


7-l(P 


[10]  Jeng,  L.-H.,  "An  Expert  System  for  Determi¬ 
ning  Title  Proper  in  Descriptive  Cataloguing: 
A  Conceptual  Model”,  Cataloguing  ic  Classi¬ 
fication  Quarterly,  7(1986)3:  pp.  55-70 

[11]  KI.  Kiinstliche  Intelligenx:  Forscfaung,  Ent- 
wicklung,  Erfahrungen  Organ  der  Fachbe- 
reichs  1  der  Gesellscbaft  fur  Informatik  e.V. 
(GI).  (1990)4 

[12]  Knorz,  G.,  "Automatic  Cataloguing  and  In¬ 
dexing”,  In:  Bamev,  P.;  Kerpe<yiev,  S  (eds.), 
"Proc.  of  Programming  '90”,  Sofia,  Bulgaria, 
1990. 

[13]  Lustig,  G.  (ed.),  "Automatische  Indexierung 
zwischen  Forschung  und  Anwendung”,  01ms, 
Hildesheim,  1986. 

[14]  Marek,  D.,  ”Zwei  Jahre  Online-Input  im  Fa- 
chinformationszentrum  Energie,  Physik,  Ma- 
thematik”,  ABI-Technik  3  (1983):  pp..  201- 
208 

[15]  "Volltext-Datenbank  per  Scanner  erstellen”, 
PC-Magazin,  17(1991),  apr.  1991 

[16]  Pitz,  H.,  "Beschreibung  der  Informations- 
struktur  von  Zeitschriflen  fiir  ein  wissensba- 
siertes  System  zur  Formalkatalogisierung.  Ent- 
wurf  und  Implementierung  einer  Benutzer- 
oberflache".  Diploma  Thesis.  TH  Darmstadt, 
Dep.  of  Computer  Science,  1987 

[17]  Schirra,  J.  R.;  Brack,  U.;  Wahlster,  W.;  Woll, 
W.,  ”WILIE  -  Ein  wissensbasiertes  Litera- 
turerfassungssystem” ,  in  Endres-Niggemeyer, 
B.;  Krause,  J.  (eds.):  "Sprachverarbeitung 
in  Information  und  Dokumentation” ,  Berlin: 
Springer  1985,  pp.  101-112 

[18]  Weibel,  S.;  Oskins,  M.;Vizine-Goetz,  D.,  "Au¬ 
tomated  Title  Page  Cataloguing:  A  Feasibi¬ 
lity  Study” ,  Information  Processing  &  Mana¬ 
gement,  25  (1989)  2:  pp.  187-204 


8*1 


Data  Compression  Techniques 

R.A.  Hogendoorn 
National  Aerospace  Laboratory 
P.O.  Box  153 
8300  AD  Emmeloord 
The  Netherlands 


1  Summary 

DatA  compretsion  can  be  used  to  reduce  the  vcdume  of 
documents.  This  results  in  considerable  saving  on  stor¬ 
age  capacity  and  in  transmission  time,  in  general,  there 
ate  two  classes  of  compression  techniques:  reversible 
compression  or  statistical  compression,  with  which  doc¬ 
uments  can  exactly  be  reproduced,  and  non-reversibie  or 
noisy  compression,  with  which  documents  can  be  repro¬ 
duced  up  to  a  ^ven  fidelity.  Non-reversibte  compression 
most  often  gives  a  higher  compression  factor,  but,  ob¬ 
viously,  at  a  price.  Reversible  ^gorithms  are  suitable 
for  text  compression  and  black-and-white  images.  Non- 
reversible  algorithms  are  suitable  for  images.  Thu  paper 
describes  the  advantages  and  caveats  of  data  compres¬ 
sion.  What  is  to  be  expected  if  data  compression  is  used 
and.  more  important,  what  is  not  to  be  expected. 

2  Introduction 

Data  compression  is  gradually  becoming  accepted  as  a 
means  to  make  better  use  of  available  resources  like  com¬ 
munication  channels  and  disk  storage.  For  example,  fac¬ 
simile  would  not  quite  have  been  as  useful  as  it  is  now 
if  data  compression  had  not  been  used.  Facsimile  uses 
an  algorithm  called  the  modified  READ  (Relative  Ele¬ 
ment  Address  Designate)  code  which  compresses  a  digi¬ 
tally  scanned  pag^  by  a  factor  of  13  on  the  average.  This 
means  that  the  transmission  time  U  only  about  1  minute 
instead  of  13  minutes. 

Another  example,  in  which  data  comprcssioR  is  indis¬ 
pensable,  is  High-DefinitioD  Television  (HDTV).  The 
raw  data  rate  coming  from  a  HDTV  camera  is  about 
470  Mbtt/s.  whereia  the  television  broadcast  channel  will 
only  allow  a  data  rate  of  about  1.75  Mbit/s.  In  order  to 
obtain  this  large  reduction  of  the  data  rate,  quite  elabo¬ 
rate  data  compression  techniques  have  to  be  used- 

Now,  what  is  data  compression?  Essentially,  data  com¬ 
pression  is  a  matter  of  modelling:  one  tries  to  model 
the  mechanism  that  generated  the  data,  i.e.  the  source. 
Hence,  data  compression  is  also  called  **8ource  coding**. 
Of  course,  the  better  the  model  is,  the  less  information 
is  needed  to  describe  the  data,  i.e.  the  ontpnl  generated 
by  the  source.  Note  that  It  is  assumed  that  the  model 
is  known  at  all  times.  The  (act  that  data  compression  is 
modeBing  expUiaa  why  there  are  so  many  differesl  data 
compression  alpmthms.  EUch  algorithm  has  its  own  pec- 
ularities.  which  makes  it  more,  or  less,  suited  for  a  par¬ 
ticular  apfrfication. 


Data  compression  algorithms  can  be  divided  into  two 
classes,  namely  reversible  and  non-revemtde  algorithms. 
Reversible  algorithms  only  change  the  representataon  ot 
the  data  into  a  more  efficient  one.  The  data,  resulting 
&om  decompressaion,  are  identical  to  the  original  data  as 
generated  by  the  source.  Non-tevtnible  algcmthma,  on 
the  other  hand,  make  only  an  approximate  representa 
tion  of  the  original  data.  The  result  after  decompressioo 
differs  from  the  original  data  by  a  ^rtain  amount,  called 
the  distortion.  For  each  non-reversible  algorithm,  there 
is  a  trade-off  between  the  amount  of  distortion  and  the 
compression  factor.  A  larger  compression  factor  results 
in  a  larger  distortion. 

One  of  the  fundamental  questions  of  data  compression  is 
what  is  the  shortest  possible  representation  of  the  output 
of  a  source?  To  atnswer  this  question,  the  concept  *infor- 
mation**  needs  to  be  defined  more  precisely.  BasicaUy, 
the  amount  of  information  that  is  conveyed  by  a  symbol 
depends  upon  the  predictabUity  of  the  symbol,  i.e.  its 
probability  of  occurrence.  The  less  probable  the  symbol, 
the  more  information  is  conveyed  by  it.  Using  this  no¬ 
tion,  the  amount  of  information  generated  by  a  source 
can  be  defined: 

JV-l 

^o(*)=  5^ -•*»(*  =  »)lofPr(*  =  *)  (1) 

/fo(x)  is  c&Ued  the  Vntiopy’'  oC  the  eoerce.  The  entxopy 
is  maximum  lor  •  source  that  outputs  cquiptobabie  sym- 
bob,  i.e. 

Pr(i  =  .)  =  l/JV.  .  e{0 . Af-1} 

An  rtarople  of  a  source  is  the  throeriog  of  a  die.  The 
output  symbol  b  the  number  of  spots  that  shows  up.  For 
thu  source,  the  entropy 

Ho{t)  =■  Z.S9  bits/symbol. 

If  the  die  is  kaaded,  say,  the  probability  of  throwing  a  I 
is  1/2,  then 

=  2.16  biu/aymbo). 

In  equation  (1),  it  b  aaanmed  that  snbneqnent  ontpnt 
symbols  of  the  snnrce  ate  independent  of  each  otker.  Nor. 
mnlly,  thb  b  not  the  case.  For  exampk,  the  entropy  of 
the  engUsh  langnnge  b  estimated  to  be 

/fo(z)  St  f  .b  bita/lettet 


if  letters  are  considered  independent  of  each  other.  How¬ 
ever,  if  the  inter-letter  dependencies,  like  the  fact  that  a 
is  almost  always  followed  by  a  are  taken  into 
account,  the  entropy  is  estimated  to  be 

H^(x)  s=  0.6  . . .  1.3  biu/letter. 

Up  till  now,  only  discrete  sources,  i.e.  sources  that  can 
output  symbols  from  a  finite  alphabet,  have  been  consid¬ 
ered.  There  are  also  sources  that  output  real  values,  e.g. 
the  reading  of  a  current  meter.  In  such  a  case,  the  accu¬ 
racy  of  the  reading  has  to  be  taken  into  account  to  define 
the  equivalent  of  an  entropy.  This  results  in  a  curve, 
the  so-called  '*rate-distortion  curve”  The  rate-distortion 
curve  specifies,  for  each  distortion,  the  entropy  of  the 
source.  Note,  that  no  particular  distortion  measure  is 
specifieil  in  advance;  it  depends  entirely  upon  the  appli¬ 
cation  that  makes  use  of  the  decompressed  data. 

Why  is  the  entropy  such  an  important  concept?  Well, 
a  main  result  in  source  coding  theory  is  C.  Shannon's 
Source  Coding  TTieorem.  Elssentially,  this  theorem  states 
that  it  is  not  possible  to  find  error-free  cod^  with  an 
average  length  below  the  entropy  of  Che  source.  More 
specific,  let  sequences  of  L  source  letters  be  coded  in  N 
bits  and  let  only  one  source  sequence  correspond  to  each 
code  sequence.  Let  Pr(error)  be  the  probability  of  oc¬ 
currence  of  a  source  sequence  for  which  there  is  no  code 
sequence.  Then,  for  any  ^  >  0,  if 

Nfi  >  +  «  (2) 

Prferror)  can  be  made  arbitrarily  small  by  making  L  suf¬ 
ficiently  large.  Conversely,  if 

(3) 

the  Pr(error)  must  become  arbitrarily  close  to  1  as  £>  is 
made  sufficiently  large. 

A  similar  result  holds  for  source  coding  with  a  fidelity 
criterium.  It  is  not  possible  to  find  codes  with  a  rate- 
distortion  curve  that  lies,  at  any  point,  below  the  rate- 
distortion  curve  of  the  source.  Note  that  the  rate- 
distortion  curve  not  only  depends  on  the  source,  but  also 
on  the  dbtortion  measure.  This  means  that  for  the  same 
source,  several  different  rate-distortion  carves  arc  possi¬ 
ble  depending  on  the  distortion  measure  that  is  used. 

3  Reversible  Data  Compression 

Reversible  data  compression  algorithms  change  tne  repre¬ 
sentation  of  the  information  such  that  the  average  length 
of  the  representation  is  as  short  as  pouible.  This  is 
done  by  using  information  about  the  probabilities  of  oc¬ 
currence  of  source  symbols.  A  simple  example  of  such 
an  algorithm  is  run-lrngth  coding.  A  ruit-lengih  en¬ 
coder  rei^aces  aeries  of  consecutive  identical  symbob  by 
the  symbol  followed  by  a  repeal  count.  Especially  with 
black-and-white  images,  this  simple  algorithm  gives  a  fair 
amount  of  comprea^n. 

A  more  elaborate  coding  technique  is  Huffman  coding. 
A  Huffman  encoder  uaea  a  fixed  table  of  codevrords.  The 
codeurofda  are  choaea  auch  that  the  aymbol  with  the  high¬ 
est  pffobabiitty  of  occurrence  gets  the  shortest  codeword. 


the  aymbol  with  the  second  highest  probability  gets  the 
second  shortest  codeword  and  so  on.  Huffman  codes  are 
good  codes:  Let  be  the  average  length  of  the  Huff¬ 
man  code  that  codes  L  source  symbols  at  a  time,  then 

HL{x)<tL{z)<HL{x)-\^\/l  (4) 

In  other  words,  Huffman  codes  result  in  an  average  code¬ 
word  length  that  is  arbitrarily  dose  to  the  theoretical 
minimum  by  taking  L  large.  However,  a  large  L  resniis 
in  such  a  large  codeword  table  that  it  is  impossible  to 
implement  such  a  code.  Also,  it  is  assumed  that  the 
probabilities  of  occurrence  of  symbols  do  not  change.  In 
practice,  this  is  not  always  true.  Huffman  codes  can¬ 
not  effidently  adapt  to  these  changing  probabilities.  In  a 
worst  case  situation,  even  data  expanrion  may  occur.  For 
example,  if  the  digital  facsimile  compression  algorithm  is 
used  for  halftone  (grey-level)  images,  data  expanMon  oc¬ 
curs,  since  the  codeword  tables  were  designed  for  twary 
images.  Although  Huffman  coding  has  its  limitations,  it 
is  an  important  coding  technique  that  is  often  used  as  a 
part  of  a  more  complex  compression  algorithm. 

A  coding  technique,  that  overcomes  the  limitations  of 
Huffman  coding  is  artlhmeltc  coding.  It  is  more  comj^x 
than  Huffman  coding,  but  it  is  also  more  effident  and  it 
can  adapt  to  changing  statistics  quite  easily.  Arithmetic 
coding  is  used  in  the  newer  data  compression  algorithms, 
such  as  the  JPEG  (Joint  Photographic  Experts  Group) 
standard  algorithm  for  still  images. 

An  entirely  different  approach  is  dictionary  coding.  The 
idea  is  to  have  a  dictionary  of  strings  (series  of  symbols) 
of  possibly  different  lengths  and,  instead  of  the  string, 
the  index  of  the  string  in  the  dictionary  is  sent.  The 
ZiV’Lempel  algorithm  is  an  example  of  this  class  of  algo¬ 
rithms.  The  Ziv-Lempel  algorithm  builds  its  dictionary 
adaptively,  based  upon  the  past  input  symb<^.  There  are 
many  variations  upon  the  basic  Ziv-Lempd  algorithm; 
they  all  give  a  good  compressioB  and  can  be  easily  de¬ 
coded.  Therefore,  this  type  of  algorithms  is  popular  for 
compression  of  files  on  personal  computers  and  worksta¬ 
tions. 


4  Non-Reversible  Data  Compression 

Non-reversible  data  compression  algorithms  malm  only 
an  approximate  representation  of  the  data.  They  intro¬ 
duce  a  distortion  with  respect  to  the  origiaal  data  that 
depends  on  the  comprenion  factor.  The  larger  the  com¬ 
pression  factor,  the  larger  the  distortum.  Non-reversible 
compression  algorithms  are  also  called  *Boay”  or  "lossy” 
compression  algorithms.  The  advantage  of  aon-teversi^ 
compression  algorithms  is  that  with  these  algorithms 
much  larger  compression  factors  can  be  obtained  than 
with  non-reversible  algorithms.  Furthermore,  there  are 
also  practical  reasons  for  using  non-revertib'e  algorithms. 
Reversible  algorithms  assign  codewords  depending  on  the 
probabilities  of  occurrence  of  source  symbol  sequences.  If 
N  is  the  number  of  different  source,  aymbc^  and  L  the 
length  of  the  sequence  of  source  symbols  that  are  encoded 
in  one  codeword  then  the  sue  of  the  codeword  table  is 
Hi.  Obviously,  the  codeword  table  siae  is  far  too  luge 
for  anything  but  small  values  of  N  and,  etpediBy,  L. 
Now,  given  that  N  =  356  for  most  images  (S-bit  pixels), 
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Figure  1:  Image  pixels  used  in  prediction 

it  is  dear  that  the  coding  of  images  is  almost  exclusively 
done  with  non-reversible  algorithms. 

The  following  paragraphs  describe  the  most  important 
non-reversible  compression  techniques  assuming  that  the 
data  being  compressed  is  image  data,  i.e.  a  future  con¬ 
sisting  o(  L  X  L  pixels. 

One  of  the  most  simple  techniques  is  predictive  coding. 
The  idea  is  to  use  the  values  of  neighbouring  pixels  to 
get  an  estimate  of  the  value  of  the  current  pixel.  Instead 
of  using  the  current  value  of  the  pixel,  the  difference  be¬ 
tween  the  estimate  and  the  current  value  is  used.  If  the 
picture  does  not  change  a  lot  from  pixel  to  pixel,  the  dif¬ 
ferences  can  be  encoded  far  more  efficiently  than  the  orig¬ 
inal  pixel  values,  e.g.  by  using  a  suitable  Huffman  code. 
Predictive  coding  is  not  inherently  non-reversible.  since 
the  differences  can  be  coded  exactly.  However,  by  dis¬ 
regarding  all  differences  below  a  certain  value,  the  com¬ 
pression  factor  can  increase  a  lot  without  a  significant 
loss  in  picture  quality.  Figure  1  shows  the  pixels  used  in 
a  simple  predictor.  The  estimate  X  is 

.t  =  + 

The  picture  is  encoded  row  by  row  and  only  the  very 
first  pixel  value  is  needed  to  start  the  decoding  process, 
so  this  value  is  retained  with  the  compressed  differences. 
Predictive  coding  gives  a  fur  compression,  while  the  im¬ 
plementation  complexity  is  low. 

A  more  elaborate  technique  is  tmnsform  codtng.  By  us¬ 
ing  a  mathematical  technique,  called  ^'transform'*,  the 
picture  is  split  into  components  that  describe  the  vari¬ 
ations  in  the  picture,  i.e.  there  is  a  component  that 
specifies  the  average  pixel  value  up  till  a  component  that 
specifies  the  fastest  variations,  that  is  pixel- to-pixel  vari¬ 
ations.  Transformation  requires  a  fair  amount  of  process¬ 
ing  time.  Therefore,  it  is  done  on  small  parts  of  the  pic¬ 
ture,  e.g.  blocks  of  8  X  8  pixels.  Also,  special  transforms 
are  taken,  like  the  Discreic  Cosine  Transform  (DCT), 
because  they  are  computationally  more  efficient  than  the 
optimum  transform. 

The  reason  to  look  at  the  variations  in  pixel  values,  rather 
than  the  pixel  values  itself  b  that  these  variations  are 
most  often  very  gradual.  Therefore,  a  few  transform  com¬ 
ponents  are  sufficient  to  give  an  accurate  reconstruction 
of  the  block  of  pixeb.  The  dbtortion  measure  used  with 
transform  coding  algorithms  b  the  Root-Mean-Sqnare 
(RMS)  error.  Thb  error  can  easily  be  expressed  in  terms 
of  the  dropped  transform  components.  So,  it  b  quite  easy 
to  make  an  adaptive  algorithm,  with  a  guaranteed  max¬ 
imum  RMS  error.  Transform  coding  gives  a  very  good 
compression,  but  b  expensive  in  terms  of  computations 
and  more  elaborate  to  implement.  The  JPEG  standard 
algorithm  for  images  b  a  transform  coding  algorithm. 

Vector  quantteation  (VQ)  b  a  compression  technique 
that,  in  theory,  can  perform  quite  close  to  the  rate- 


dbtortion  bound.  The  image  b  divided  into  v^tots,  for 
example,  a  n  =  /  x  m  block  of  pixel  values  can  taken  as  a 
n-dimensionad  vector.  Each  vector  b  then  compared  with 
a  collection  of  representative  tem|datet  or  codeveetors. 
The  best  match  b  chosen  and  its  index  in  the  codebook 
becomes  the  codeword.  Essentially,  VQ  b  a  dictionary 
method  as  described  is  the  previous  section. 

Compression  b  obtained  in  VQ  by  usug  a  codebook  with 
relatively  few  codevectors  compared  to  the  number  of 
possible  vectors.  The  complexity  of  VQ  depends  on  the 
size  of  the  codebook  and  the  way  in  which  comparisons 
with  the  codevectors  are  made.  Compression  b  very  good 
and  decoding  b  very  fast.  Therefore,  it  b  often  used  Cor 
coding  video  images.  A  disadvantage  b  that  the  genera¬ 
tion  of  a  good  codebook  requires  a  lot  of  processing. 

5  Applications 

At  thb  moment,  data  compression  techniques  for  text 
compression  are  readily  available  on  most  workstations 
and  persona)  computers.  A  large  share  of  the  algorithms 
are  in  the  public  domain  and  can  freely  be  used.  Most 
of  them  are  based  on  the  Ziv-Lempel  algorithm  and  pve 
a  fair  compression  (factors  between  2  to  5).  Several  of 
these  algorithms  have  become  de-facto  standards,  like 
/em  compress  and  tharc.  For  other  types  of  informa¬ 
tion,  like  images,  speech  and  video,  only  \anned”  and 
application-specific  sejutions  are  available.  Thb  sitaa- 
tion,  however,  will  change  within  the  next  few  yean.  A 
lot  of  standardisation  has  been  and  b  being  done.  JPEG 
specified  a  standard  algorithm  for  stUl  images.  The  Mo¬ 
tion  Pktuie  Experts  Group  (MPEG)  prepared  a  draft 
standard  algorithm  for  video.  It  will  only  take  a  couple 
of  years  before  these  standard  solutions  are  available  on 
a  wide  basis  and  for  an  affordable  price.  The  devdop- 
ment  of  HDTV,  will  also  result  in  more  and  better  data 
compression  techniques.  All  these  techniques,  however, 
will  be  variations  upon  the  basic  algorithms,  described 
in  thb  paper.  The  evolution  of  micro-elect ronks  makes 
it  possible  to  use  the  more  advanced  and  computatkHi- 
alJy  intensive  compression  algorithms  that  give  far  better 
compression  than  the  present  algorithms. 
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SUNMART 

The  special  problems  of  materials 
property  databases,  as  opposed  to 
scientific  numeric  databases,  are 
described.  A  review  of  progress 
towards  materials  property  databases 
since  the  seminal  workshop  of  1982  in 
Fairfield  Glade,  Tennessee  Is  given 
based  on  the  recent  Third 
International  Symposium  on  the 
Computerization  and  Use  of  Materials 
Property  Data.  Topics  Include 
standards  and  data  representation, 
standards  and  database  development, 
expert  systems  and  materials 
databases,  data  Issues  for  engineering 
materials.  Industrial  applications, 
and  working  and  prototype  systems. 

1.  INTROOacXIOR 

In  the  course  of  dally  life,  few 
people  give  much  thought  to  the 
engineering  effort  Invested  In  the 
structures  and  devices  which  they  use 
and  trust.  Flying  at  10  000  metres  In 
a  modern  aircraft,  turning  on  a  new 
electrical  appliance,  swinging  a  hlah- 
tech  tennis  racquet  at  a  ball  — -  all 
these  are  acts  of  faith  In  an 
Integrated  design  and  manufacturing 
system  whose  output  can  be  trusted  to 
be  safe  and  reliable.  That,  no  doubt. 
Is  as  It  should  be;  those  concerns  are 
delegated  to  experts  whose  task  It  Is 
to  take  both  natural  and  man-made 
materials  and  fashion  them  Into 
servants  of  mnklnd.  To  optimize 
their  use  of  those  materials,  however, 
the  designers  and  engineers  need  to 
know  all  about  them.  In  other  words 
they  need  property  data  for 
engineering  materials. 

Until  the  coaqNiter  caaie  of  age,  these 
experts  either  performed  testa  to 
obtain  their  own  data  or  nought  those 
data  from  a  variety  of  printed  or 
human  sources.  Given  the  expense, 
slowneaa  and  inconsistent  quality  of 
data  obtained  this  way,  it  is  not 
surprising  that  the  computer  was  seen 
as  a  powerful  component  to  be  added  to 
the  information  loop.  Out  of  this 
grow  the  concept  of  ’computerised’ 
property  data. 

Significant  efforts  to  deal  with  such 
data  began  about  ten  years  ago  and 
progress  was  asMly  Illustrated  at  the 
recent  Third  International  Symposium 
on  the  Computerisation  and  Use  of 
Materials  Property  Data'.  Three  days 
of  presentations,  coupled  with 


demonstrations  of  working  systems, 
showed  that  while  much  has  been 
accomplished,  much  remains  to  be  done. 

The  purpose  of  this  paper  la  to  sketch 
the  current  scene,  primarily  through  a 
synopsis  of  the  Symposium,  but  first 
It  Is  necessary  to  outline  some 
background  material. 

2.  BACKOROUMD 

2.1.  Definitions 

An  ordered  collection  of  computerized 
property  data  for  engineering 
materials  la  known  as  a  ’database”. 
Databases  may  conveniently  be  divided 
Into  two  broad  categories:  reference 
or  source.  ’Reference’  Includes  those 
databases  that  contain  bibliographic 
citations  or  references  to  other 
Information  sources;  ’source’  Includes 
those  databases  that  contain  numeric, 
textual-niunerlc ,  full-text  or  Image 
information^.  In  searching, 
therefore,  a  ’hit"  In  a  reference 
database  points  to  a  place  where  the 
desired  Information  may  be  found 
whereas  a  ’hit’  In  a  source  database 
contains  the  desired  Information  Item 
Itself. 

Thus,  an  engineering  owterlals 
property  database  (MPD)  Is  a  source 
database.  More  specifically.  It  Is  an 
ordered  collection  of  data  items  whose 
values  (1)  correspond  to  various  large 
scale  properties,  parameters  or 
attributes  of  practical  materials  and 
(2)  are  critically  evaluated  or 
validated  by  experte  prior  to  their 
being  Included  In  the  database. 

Clearly,  there  are  almllarltlee 
between  MPD  and,  scientific  niuaerlc 
databases  (SND)^  but  there  are  also 
significant  differences.  Whereas  SND 
tend  to  deal  with  the  fundamental, 
microscopic  properties  of  eleswnts  and 
ccsnpounds,  MTO  tend  to  deal  with  the 
extensive,  maccoacoplc  properties  of 
natural  or  fabricated  eubstances  which 
can  depend  upon  the  manufacturing 
process,  geoiietry,  use  history  and 
mmxtj  other  faetore.  This  aspect  of 
MPD  will  be  amplified  in  Se^ion  3. 

2.2.  Mietery 

It  is  generally  felt  that  the  history 
of  ooMputerised  materials  property 
data  dates  from  an  international 
workshop  convened,  at  Fairfield  Glade, 
Tennessee  In  1982’.  This  seminal 
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meeting  was  follow^  by  two  others  in 
Petten  (1984’,  1988*)  and  one  in 
Schluchsee'  (1985).  As  the  field 
matured,  internationaL,  symMsia  were 
held  in  1987*  and  198^  with  the  most 
recent  held  September  1991  as 
mentioned  above.  In  addition  to  these 
major  gatherings,  several  discipline- 
oriented  seminars  were  held  in  1984- 
1986  to  explore  problems  related  to 
particular  industries,  the  one 
pertaining  to  the  aerospace  industry 
being  typical'  . 

Progress  was  not  limited  to  meetings, 
of  course,  and  practical  online 
prototype  database  systems,  the 
Materials  Property  Data  Network  (USA) 
and  the  Demonstrator  Programme 
(Commission  of  European  Connunities) 
began  in  1984  and  1986  respectively. 

It  is  fortunate  that,  from  the  outset 
of  this  activity,  the  need  for 
international  standards  has  been 
acknowledged.  In  this  way  cooperation 
has  been  fostered,  duplication  of 
effort  avoided  and  the  need  to  strike 
compromises,  to  satisfy  strong 
factions  with  pre-existing  vested 
interests,  circumvented.  Groups  like 
the  Technical  Working  Area  10  of  the 
Versailles  Project  on  Advanced 
Materials  and  Standards  (VAMAS)  and 
the  Committee  on  Data  for  Science  and 
Technology  (CODATA)  of  the 
International  Council  of  Scientific 
Unions  foresaw  the  need  for  standards 
and  pressed  vigorously  for  them  to  be 
developed.  Responding  to  this 
challenge,  the  American  Society  for 
Testing  and  Materials  (ASTK)  formed 
Committee  E-49  in  1986.  Formally 
designated  the  Committee  on  the 
computerixation  of  Materials  Property 
Data,  E-49  was  given  the  task  of 
developing  standard  classifications, 
guides,  practices  and  terminology  for 
building  and  accessing  materials 
property  databases.  Only  through  such 
Internationally  accepted  consensus 
standards  can  the  quality  and 
reliability  of  MPD  be  maintained  and 
compatibility  among  different 
applications  be  assured  in  order  to 
reduce  costs  and  promote  exchange. 

3.  PROBUDia  amooi  xo  npd 

To  understand  why  MPD  have  required 
such  a  vast  effort  and  the  number  of 
practical  databases  is  still 
relatively  smII,  it  is  necessary  to 
grasp  why  these  data  are  more 
difficult  to  deal  with  than 
fundamental  property  data. 

Fundamental  properties,  like  for 
example  the  specific  heat  of  pure 
aluminum  or  ttw  crystal  structure  of 
salt,  are  an  Intrinsic  projwrty  of 
that  aubstance.  lienee  they  tmd  to  bo 
eonatant  in  tiaw,  independent  of  the 
manufacturing  procesa  and  indmendent 
of  the  direction  or  technique  by  which 
they  are  measured.  Furthermore, 
everyone  agrees  the  symbols  "Al”  and 
"RaCl*  uniquely  identify  the 
sttbstancea  in  question  and  can 
reproduce  those  substances,  and  the 


measurement  if  need  be,  in  their 
laboratory  with  no  additional 
Information.  The  corresponding  entry 
in  a  database  of  fundamental 
properties  in  principle,  therefore, 
need  ha  little  more  than  a  numerical 
value  assigned  to  field  name. 

Engineering  properties,  like  for 
example  the  creep  strength  of  steel  or 
the  fracture  toughness  of  an  aluminum 
alloy,  are  not  intrinsic  properties  of 
those  substances.  They  may  change  as 
the  material  is  loaded  and  as  it  ages, 
they  will  certainly  depend  on  the  heat 
treatment  used  in  manufacture,  and  the 
numerical  values  assigned  to  those 
parameters  really  have  meaning  only  if 
cited  in  conjunction  witti  the  test 
method  used  to  determine  them. 
Furthermore  an  engineer  in  one  country 
may  know  a  particular  alloy  of  steel 
as  "Type  ABC*  whereas  a  colleague  in 
another  jurisdiction  might  know  it  as 
"Type  XYZ*.  Thus,  to  be  meaningful, 
entries  for  these  properties  in  a 
database  must  be  more  than  rudimentary 
numerical  values;  they  must  also 
Include  data  about  the  data,  otherwise 
known  as  "metadata*,  which  may  be 
defined  as  a  set  of  data  descriptors 
and  other  associated  information  that 
characterize  the  individual  data 
values. 

Clearly,  these  examples  are  simple  in 
the  extreme  but  should  illustrate  the 
inherent  "fuzziness"  of  MFD.  While 
the  computer  is  essential  for 
organizing  and  disseminating  MPD,  it 
is  frustratingly  "mindless"  and  rigid 
when  it  must  be  harnessed  to  deal  with 
fuzzy  entitles.  It  is  for  this  reason 
that  so  much  effort  has  been  expended 
in  learning  and  agreeing  how,  in  a 
computer  environment,  (1)  to  specify 
engineering  materials  unambiguously; 

2)  to  determine  what  minimum  set  of 
ata  and  metadata  are  required  to 
define  a  given  property;  (3)  to 
stipulate  the  quality  of  those  data; 
(4)  to  convey  that  succinctly  through 
uniform  vocabulary  to  users  with 
disparate  backgrounds  and  interests. 

An  additional  complication  that  must 
be  faced  by  producers  of  MPD  is  that 
the  quality  of  data  needed  varies  with 
the  objectives  of  the  user.  Thus, 
during  conceptual  design,  the  data 
need  only  be  approximate  but  should 
cover  a  wide  range  of  material  classes 
so  that  all  plausible  candidates  are 
considered.  During  preliminary 
design,  range  of  coverage  is  no  longer 
a  factor  but  now  higher  precision  and 
reliability  are.  ultimately,  in  the 
final  design  stage,  the  best  accuracy 
and  precision  are  mandatory.  Because 
scope  of  coverage  and  high  accuracy 
are  easentially  Independent  goals, 
each  attained  only  at  considerable 
cost,  cosq>romlaes  must  be  made. 

4.  PRoenu  xo  daxb 

As  mentioned  In  the  Introduction,  the 
recent  Third  International  Sympoelum 
on  the  Ooimterlsatlon  and  Use  of 
Haterlala  Prt^erty  Data  provides  an 
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excellent  overview  of  the  state  of 
this  activity.  In  reviewing  the 
proceedings,  It  Is  convenient  to  use 
the  categories  used  by  the  program 
organizers. 

4.1.  Standards  and  Data  Representation 

Because  It  Is  so  costly  to  obtain  and 
disseminate  and  so  vital  to  a 
proficient  manufacturing  industry, 
materials  Information  must  be  regarded 
as  an  international  commodity. 
Harmonization  of  such  basic  entitles 
as  terminology,  materials  description, 
data  representation  and  data  quality 
Is  the  key  to  sharing  that  commodity 
efficiently  and  economically. 

Terminology,  for  Instance,  must 
attempt  to  cover  comprehensively  and 
accurately  the  vast  range  of  materials 
(metals,  polymers,  refractories, 
composites,  etc.),  must  cope  with 
three  different  origins  of 
nomenclature  (the  data  source,  the 
database  producer  and  the  user)  and 
must  admit  hierarchical  and  synonymous 
thesaurus  relationships  enabling  a 
"mindless ^  computer  to  expand, 
translate  or  llnl^  terms 
appropriately' ' . 

Lest  anyone  think  that  materials 
databases  could  be  trivially  produced 
by  scanning  printed  works  and  feeding 
the  results  Into  a  computer.  It  was 
shown  that  this  process  is  anything 
but  straightforward.  A  detailed 
account  of  the  problems  encountered  In 
capturing  the  Information  In  a  major 
refarenca  such  as  the  Military 
Handbook  SB  amply  demonstrated  the 
Impractlcallty  of  attempting  to 
convert  such  data  collections  to 
machine  readablcL  form  solely  by 
automatic  means'^. 

4.3.  Standards  and  Database 
Development 

In  Its  relatively  brief  existence. 
Committee  E-49  has  already  succeeded 
In  producing  five  ASTM  Standards^. 
Given  that  such  standards  are  only 
adopted  after  scrutiny  and  consensus 
of  progressively  wider  segments  of  the 
ASTM  membership,  as  a  projposed 
standard  proceeds  from  the  initial 
drafting  group  to  the  Society  level, 
this  Is  Indicative  of  the  intense 
level  of  activity  In  this  area. 

E1407,  the  most  recent  standard, 
relates  to  the  quality  and  reliability 
of  data,  database  system  management, 
system  capabilities  and  data  security. 
The  sidelines  described  are  intended 
for  both  producers  and  distributors  of 
MPD  and  are  applicable  to  all  delivery 
systems  whether  personal  cosqniter 
diskettes,  CD-ROM  disks,  ssignetic 
tapes,  local  area  networks  wc  oublic 
telecommunications  networks'*. 

To  facilitate  exchange  of  data  between 
databases  and  between  databases  and 

applications,  it  is  essential  to 
develop  a  SHitually  agreed  upon, 
machine  independent  format  In  which 


the  data  may  be  arranged.  Without 
such  neutral  exchange  formats,  anyone 
wishing  to  acquire  data  from  another 
source  must  first  convert  those  data 
to  a  form  recognized  by  his 
application.  Repeating  that  exercise 
for  every  new  data  source  is  obviously 
an  Inefficient,  error-prone  procedure; 
hence  the  Interest  in  neutral  exchange 
formats.  To  date  there  exists  no 
widely  accepted  format  but  some 
organizations  bave  developed  ones  for 
their  own  use'".  It  seems  likely,  in 
fact,  that  a  workable  format  may 
emerge  as  a  subset  of  the  formats 
being  developed  in  the  STEP  program. 
STEP  (Standard  for  the  Exchange  of 
Product  Model  Data)  is  an  activity  of 
the  International  Standards 
Organization  aimed  at  developing  a 
generic,  computer-compatible  means  of 
describing  manufactured  goods  . 

4.3.  Expert  Systems  and  Materials 

Databases 

As  evidence  that  the  field  of  MPD  Is 
maturing  and  Interest  Is  expanding  to 
matters  other  than  the  data 
themselves,  a  complete  session  was 
devoted  to  the  role  of  expert  systems. 

The  user  Interface  Is  one  application 
receiving  attention,  it  Is  proposed 
to  use  expert  systems  to  assist  a 
novice  user  in  formulating  a  database 
query  or  to  aid  the  user  to  appreciate 
the  limitations  of  the  data  retrieved. 
Potentially,  expert  systems  offer  a 
means  of  handling  "fuzzy"  data,  from 
both  the  query  and  display  points  of 
vitm,  that  cannot  afford  to  be 
Ignored.  On  the  other  hand.  In  the 
inevitable  zeal  to  apply  these  tools, 
one  must  not,  neglect  their  inherent 
limitations''. 

Another  application  described  was  the 
prediction  of  behaviour,  cracking  of 
stainless  steels  in  a  hostile 
environn»nt,  in  this  Instance,  by 
Interpreting  field  data  in  terms  of  an 
extensive  )uiow^dge  base  of  rules  and 
materials  data'°. 

In  a  dramatic  live  demonstration,  one 
group  demonstrated  the  use  of  expert 
systems  as  a  tool  for  teaching 
students  to  exploit  MPD  in  a  creative. 
Intuitive  manner.  Rather  than  bury 
the  student  in  reams  of  tables  and 
charts,  the  system  encourages  him/her 
to  think  In  terms  of  so-called  "Merit 
Indices",  combinations  of  materials 
properties  which,  if  maximized, 
maximize  performance.  Such  a  system 
almost  automatically  forces  a  designer 
to  look  at  cosqpetitlve  materials 
thereby  avoiding  the  tendency  simply 
to  use  what  hae  worked  in  the  past  and 
possibly  loading  tci  more  innovative, 
optimized  products'". 

4.4.  Data  Issues  for  Bagiseerinq 
Materials 

While,  as  desecibed  above,  work  in 
devel^iag  standards  f'>r  dealing  with 
engineering  materials  in  cosqputer 
systems  began  only  quite  recently. 


other  etenderde  relating  to  materials 
have  existed  for  years.  These 
standards,  having  for  example  to  do 
with  testing,  classification, 
designation  ,tnd  use,  «iere  produced 
without  the  computer  in  view  and  are 
not  necessarily  easily  adapted  to 
computerised  systems.  Furthermore, 
they  have  a  certain  inertia  and  status 
from  years  of  acceptance; 
practitioners  will  therefore  not 
readily  abandon  them  in  favour  of  a 
new  acnesM  just  because  it  is  claimed 
to  be  more  amenable  to 
computerisation.  Designation  schemes 
are  a  case  in  point.  Existing  schemes 
tend  to  be  amemonic,  sacrificing 
comprehensivenaas  of  detail  for  ease 
of  use;  designers  of  computerized 
schesms  prefer  designators  to  be 
comprehensive  because  an 
unintelligible  aeries  of  characters  is 
not  a  problem  to  a  computer.  Thus 
there  is  a  certain  tension  between 
practical,  working  standards  and 
standards  ideal^  suited  to 
computerization". 

Advanced  composite  materials  are  being 
used  to  replace  the  store  traditional 
aluminum  alloys  in  aircraft,  aerospace 
and  naval  applications.  These 
materials  are  inherently  difficult  to 
characterize  because  their  property 
data  are  especially  sensitive  to  such 
factors  as  composition,  configuration, 
processing  and  teat  methodology.  Data 
for  composites  therefore  require 
exceptionally  carefully  evaluation 
before  being  added  to  a  database  along 
with  extensive,  well-documented 
metadata^’ . 

Not  all  issues  related  to  MPD  are 
technical.  If  materials  databases  are 
to  be  successful,  they  must  meet  real 
needs  and  their  perceived  value  must 
be  such  that  a  user  is  willing  to  pay 
a  fair  price  to  access  them. 
Conclusions  from  the  prototype  online 
system  (Materials  Database 
Demonstrator  Programme)  of  the 
Commission  of  European  Communities 
(CECI  indicated,  in  fact,  that 
trialists  regarded  data  as  having  low 
value.  It  is  evident  therefore  that 
potential  clients  must  be  educated  to 
view  technical  information  like  MPD  as 
a  valuable  resource,  created  through 
financial  and  intellectual  investment. 
Directorate  general  XIII  of  the  CEC  is 
about  to  embark  on  a  program  to 
Address  these  and  related  concerns". 

4.5.  ladnstrial  Implications 

The  integration  of  materials  selection 
into  the  engineering  design  process, 
in  an  in-house  Materials  Infonsation 
(Database)  System,  was  reportsd  by  one 
aerospace  company  as  an  ismortant 
element  in  providing  a  competitive 
advantage.  Until  the  next  break¬ 
through  in  swterials  technology, 
manufacturers  rmeain  eosipetitlve 
chiefly  by  continually  refining  what 
they  have  done  before.  Isprovssmnts 
in  the  polity  of  NPO  permit  design 
margins  to  be  narrowed  mui 
aanufaeturlng  cost  savings  to  be 


identified.  Design  and  development 
times  and  costs  are  reduced  because 
fewer  design  iterations  are  required 
to  attain  a  given  set  of  objectives. 

To  enhance  the  utility  of  the  System, 
provision  is  made  to  feedback 
experience  gained  in  using  a  given 
material  and  to  document  the  reasoning 
behind  design  choices.  With  this 
accumulation  of  design  experience  the 
updated  database  progressively  b^omes 
a  more  valuable  company  resource". 

A  role  for  MPD  has  also  been  found  in 
structural  integrity  assessment 
programs.  Niunerous  large-scale 
engineering  complexes,  such  as  steam 
generation  or  petro-chemical  plants, 
operate  under  conditions  in  which  the 
crucial  properties  of  some  materials 
diminish  with  service.  Thus  there  is 
a  need  for  databases  containing 
information  on  alloys  whose  properties 
are  degraded  with  long  periods  of 
stress  and  high  temperature  service. 
Incorporating  such  MPD  with  residual 
life  models  In  an  online,  non-invasive 
monitoring  system  would  give 
management  real-time  analysis  of  plant 
integrity  along  with  an  ability  to 
forecast  useful  life  time". 

At  the  other  end  of  the  technological 
spectrum,  NASA  is  developing  a 
database  of  fracture  mechanics 
properties  of  materials  for  use  in 
fracture  control  analysis  of  space 
hardware.  In  light  of  the  discussion 
in  the  first  paragraph  in  sub-section 
4.4  above,  it  is  Interesting  to  note 
that  the  database  developers  found  it 
expedient  to  devise  a  specialized, 
"intelligent''  (ie.  non-mnemonic  & 
essentially  unintelligible  to  a  human) 
identification  code  for  their  data". 

4.6.  working  and  Prototype  Systems 

Tangible  evidence  of  progress  in  MPD 
was  provided  by  demonstrations  of  a 
number  of  commercial  and  prototype 
systems.  (A  list  of  participating 
organizations  is  given  in  the  Annexe. ; 

That  all  except  one  of  the 
demonstrations  were  microcomputer- 
based  rather  than  online  to  a  remote 
host,  is  somewhat  indicative  of  the 
manner  in  which  this  field  is 
evolving.  (In  this  particular 
location,  lack  of  easy  access  to 
telecommunications  may  have  mitigated 
against  more  online  demonstrations  but 
the  general  trend  is  still  evident.) 
Thus,  quite  reasonably  and  logically, 
MPD  developers  are  currently 
concentrating  on  relatively 
hoa»geneous  subsets  of  siaterials 
thereby  serving  a  defined  market  and 
gaining  experience  without  being 
engulfed  by  the  cosmlexltles  of 
dealing  with  a  broad  range  of 
materials.  TIm  exception  to  this  is 
the  online  Materials  Property  Data 
Network  which  already  covers  plastics, 
altwinum  and  etaels  with  plana  to 
broaden  its  scope. 
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5.  CtWCLUSIONS 

It  is  evident  that  much  has  been 
accooipliahed  since  the  Fairfield  Glade 
meeting  in  1982.  What  was  then  a 
proverbial  gleam  in  the  eyes  of  a  few 
visionaries  has  grown  into  a  maturing 
sub- field  of  the  informatics  industry. 
Datatcses  now  exist  covering  an 
impressive  range  of  practical 
engineering  materials;  search  systems 
vary  from  simple  "look-up'  to 
artificial  intelligence-assisted 
queries;  standards  have  been  developed 
to  describe  engineering  properties  and 
data  quality  consistently. 

Future  work  will  certainly  involve 
further  efforts  to  develop  standards 
for  facilitating  data  exchange  and  to 
extend  data  recording  formats  to 
additional  materials.  Increasingly 
intelligent  and  helpful  "front  ends" 
will  be  perfected  to  assist  users  at 
all  levels  of  expertise.  Jt  la  highly 
probable  that  online  and  stand-alone 
systems  will  evolve  in  parallel,  each 
addressing  their  respective  markets. 

Engineering  students  in  the  next 
century  may  well  be  as  blasd  about 
their  materials  data  as  the  general 
public  is  now  about  their  utrlization 
of  "ordinary"  structures  arid  devices. 
Thanks  to  their  predecessors,  these 
students  should  be  free  to  major  in 
design  Innovation  rather  than  in  data 
pursuit. 


8.  HatDB:  Materials  Information 
System 

ASH  International,  USA 

9.  MATEOS,  Materials  Technology 
Education  System 

Royal  Institute  of  Technology, 
Sweden 

10.  Materials  Databases  from  NIST 
Standard  Reference  Data 
National  Institute  of  Standards 
and  Technology,  USA 

11.  Materials  Property  Data  Network 
STN  International,  USA 

12.  MATUS  fi  Copper  Data  Disks 
Engineering  Information  Company 
Ltd.,  UK 

13.  MORPHS:  Software  for  the  Natural 
Rubber  Formulary 

Rubber  Consultants,  UK 

14.  MTDATAs  The  NPL  Databank  for 
Metallurgical  Thermochemistry 
National  Physical  Laboratory,  UK 

15.  M-Vision  -  Composites 
PDA  Engineering,  USA 

16.  PAL:  Expert  System  for  Selecting 
Industrial  Adhesives 
Permabond,  UK 

17.  PERITUS  Engineering  Materials 
Database  System 

MATSEL,  UK 


ANNEXE 

Demonstrations  of  Computerized 
Materials  Property  Data  systems  and 
Expert  Systems 


1.  Achilles:  Corrosion  Expert 
System 

National  Corrosion  Advisory 
Service 

National  Physical  Laboratory,  UK 

2.  BASF  Plastics  Materials 
Information  Systems 
BASF  pic,  UK 

3.  Computerised  Selection  of  Powder 
Materials  for  Structural 
Components 

MPR  Publishing  Services  Ltd.,  UK 

4.  COHAR:  International  Databank 

for  Certified  Reference 
Materials 

Laboratoire  National  D'Essais, 
France  6  Laboratory  of  the 
Governswnt  Chemist,  UK 

5 .  Copper  Select 

Copper  Development  Association 
Inc.,  USA 

6.  Engineering  Materials  Selector 
Department  of  Engineering 
University  of  Cambridge,  UK 

7.  Reader  Database  fi  FATDAC 

ERA  Technology,  UK  fi  Failure 
Analysis  Associates,  USA 
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FACTUTATING  THE  TRANSFER  OF  SCIENTIFIC  AND  TECHNICAL 
INFORMATION  WITH  SCIENTinC  AND  TECHNICAL  NUMERIC  DATABASES 

H.  IlaUer 

Defense  Technical  liifoimalion  System 
Office  of  Infonnation  Systems  and  Technology 
Cameron  Station 
Alexandria,  VA  22304-6145 
United  States 


SUMMARY 

The  Defense  Technical  Information  Center 
(DTIC)  provides  services  primarily  to  librarians 
and  technical  information  specialists.  In  an  effort 
to  better  serve  engineers  and  scientists,  the  end 
users,  DTIC  conducted  a  technology  assessment  of 
users  and  developers  of  scientific  and  technical 
numeric  databases.  OTTC’s  Depanment  of 
Defense  Gateway  Information  System  (DGIS) 
provides  the  access  mechanism  to  databases  and 
the  Multi-Type  Information  and  Data  Analysis 
System  (MIDAS)  will  provide  the  capabilities  to 
process  bibliographic  information  and  numeric 
data. 

I.  INTRODUCTION 

The  Defense  Technical  Information  Center’s 
(DTIC)  mission  includes  facilitating  the  transfer 
of  scientific  and  technical  information  within  the 
Department  of  Defense  (DoD).  Librarians  and 
technical  information  specialists  represent  a  major 
component  of  DTICs  customer  base  who  often 
serve  as  intermediaries  between  DTIC  and 
scientists  and  engineers,  the  end  users  of  the 
information.  DTIC  provides  several  online 
systems  for  their  use  in  providing  technical 
information  to  researchers  within  their 
organizations.  The  first  of  these,  the  Defense 
RDT&E  Online  System  (DROLS)  has  provided 
classified  and  unclassified  access  to  databases 
generated  by  DTIC  since  1974.  The  most  recent 
system,  the  Department  of  Defense  Gateway 
Information  System  (DGIS),  provides  access  to 
over  700  databases  generated  by  government  and 
commercial  entities. 

DGIS 

Dialing  into  DGIS,  users  can  automatically 
connect  to  and  search  these  remote  services. 


download  the  search  results  onto  the  DGIS  host 
computer  at  DTTC  and  then  process  the  results 
using  DGIS  analysis  and  report  generating  tools 
and  utilities. 

DGIS  has  provided  many  searching  and  processing 
funct'-'ns  needed  by  users  of  bibliographic 
information.  These  include  performing  multi¬ 
tasking  (i.e.  running  multiple  programs 
concurrently)  eliminating  duplicate  citations,  and 
providing  both  menu  and  command  driven  systems 
as  chosen  by  the  user.  However,  there  are  some 
functions  that  at  this  time  might  be  better 
performed  remotely,  at  the  customer’s  site.  These 
include  retrieving  and  processing  the  full  text  of 
a  record  and  marking  citations  for  retrieval. 
Desktop  workstations  provide  the  capability  for 
improved  user  interfa'''’"  Improved  user 
interfaces  ••  what  it.c  u.ser  sees  -  can  include 
multi-tasking  in  a  windowing  environment,  pop¬ 
up  "  in’iows,  and  hypermedia.’ 

Presently,  the  DGIS  interface  is  character-based. 
Persona!  Computer  (PC)  users  with  a  graphical 
user  interface  laihcr  than  a  character  user 
interface  are  more  productive,  more  accurate,  and 
less  frustrated  in  their  work,  regardless  of  the 
individual’s  microcomputer  experience.^  DGIS’ 
connectivity  to  a  variety  of  information  services 
containing  a  wide  range  of  subject  matter  has 
implications  which  go  beyond  accessing  varied 
information.  With  improved  inter&ces  to 
facilitate  ease  of  use,  the  DGIS  user  base  can 
expand  from  the  traditional  information 
intermediary  to  include  the  end  user,  the  engineer 
and  scientist  can  be  served  directly. 

A  PC  user  dialing  in  to  DGIS  must  run  terminal 
emulation  software  compatible  with  a  UNIX  host. 
They  run  a  terminal  emulation  program  to  access 
DTICs  UNIX  system.  This  allows  PC  users  to 
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continue  to  nin  their  favorite  applications. 
However,  the  customer  cannot  easily  make  full 
use  of  the  UNIX  multi-tasking  capabilities  nor 
does  the  user  interact  in  a  vrindowing 
environment  with  a  graphical  user  interface 
because  of  DGIS’  character-based  interface. 

The  gateway  concept  has  created  a  dichotomy. 
On  the  one  hand,  DGIS  customers  have  dial-in 
capability  over  a  wide  area  network  to  connect  to 
varied  resources  and  use  the  disk  space  and 
processing  tools  at  the  host  computer.  On  the 
other  hand,  they  need  a  graphics  based  interface. 
For  most  of  our  customers  with  their  current 
telecommunications,  connecting  over  a  wide  area 
network  to  a  system  with  a  windowing 
environment  and  a  graphical  user  interface  is 
presently  impractical.  Our  customers  lack  high 
speed  lines  for  intensive  processing  required  to 
send  graphic  instructions.  Until  the  availability  of 
more  rapid  telecommunications,  a  possible 
solution  is  to  offload  some  of  the  centralized 
processing  from  the  host  to  the  remote  system 
and  use  DGIS  to  gateway  to  other  information 
systems.  The  Multi-type  Information  and  Data 
Analysis  System  (MIDi^^)  is  an  attempt  to  do  just 
that 

MIDAS 

DTIC  customers  are  a  disparate  group  with  a 
variety  of  computer  environments,  information 
resource  needs,  and  information  processing  needs, 
not  unlike  other  computer  users.  That  is,  they 
use  a  variety  of  hardware  platforms  -  IBM- 
compatible  and  Apple  Macintosh  PCs,  UNIX- 
based  workstations,  minicomputers,  and 
mainframes  with  varying  operating  systems. 
Therefore,  DTIC  must  develop  applications  that 
are  interoperable,  that  is,  not  machine  nor 
operating  system  dependent  Regardless  of  the 
hardware  or  operating  system  used,  the 
application  should  have  the  look  and  feel  with 
which  the  user  is  most  familiar.  However,  DTTC 
cannot  incur  the  expense  of  copying  and 
recompiling  programs  onto  different  machines. 
Programs  developed  under  the  X  Window 
standard,  a  graphics  and  network  protocol, 
eliminate  many  of  these  portability  issues.  This 
prototype  vrill  show  X  Windows-based  UNIX 
programs  running  on  PCs  and  sharing  information 
between  DOS,  Mac  and  UNIX  programs. 


Customers’  data  and  information  needs  vary  as 
widely  as  do  their  computer  systems,  networtog, 
and  accessing  capabilities,  llieir  data  resources 
vary  in  media  and  distribution  formats.  Besides 
bibliographic,  there  are  other  database  types  like 
these  containing  full  text  and  numeric 
information.  With  the  influx  of  multi-media 
platforms,  audio  and  image  databases  will  soon 
become  more  prevelant.  Also,  data  and 
information  usage,  and  therefore  the  processing 
and  analysis  tools  and  utilities  required,  vaiy 
among  intermediaries  and  end  users.  So  do  the 
kinds  of  data  and  information  needed  whether  it 
is  tabular  data,  full  text  of  artides,  audio  or 
graphics. 

The  project.  Multi-type  Information  Data  Analysis 
System  (MIDAS)  will  respond  to  these  issues. 
Developing  a  MIDAS  prototype  will  demonstrate 
DGIS  with  a  graphical  user  interface,  a  remote 
connection  to  a  gateway  of  information  from  a 
local  workstation  environment,  incorporating 
interoperability,  accessing  different  database  types 
and  processing  other  types  of  data.  In  developing 
a  MIDAS  prototype,  we  make  the  assumptions 
that  the  workstation  and  PC  market  (UNIX-DOS- 
Mac)  are  the  typical  computing  environments 
within  our  user  community. 

The  prototype  will  be  developed  on  a  Sun 
workstation  and  will  demonstrate  connectivity 
using  the  Sun  as  a  server  to  run  applications  and 
a  PC  as  a  termirtal.  Of  the  largest  number  of 
new  workstation  buyers,  24%  of  the  people  ate 
former  PC  users.  The  biggest  interest  in 
workstations  is  from  the  engineering  and  technical 
community.^ 

Since  DTIC  is  in  the  business  of  transferring 
scientific  and  technical  information  to  the 
Department  of  Defense  (DoD)  community,  the 
prototype  will  indude  sdeniific  and  tedinical 
numeric  databases,  primarily  materials  properties. 

SdentUk  and  Tedmicai  Numeric  Databases 
DGIS  provides  some  solutions  to  the  present  day 
processing  of  bibliographic  information  hf 
connecting  to  different  database  types  all  over  the 
world.  Through  DGIS,  DTIC  can  actively  assist 
end  users  in  accessing  and  processing  not  only 
bibliographic  information,  but  other  types  of 
information,  such  as  numeric  data. 
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Numeric  databases  are  coUectioos  of  informatioB 
and  data.  They  contain  both  data  and  metadata 
or  textual  information  relating  to  the  data.  There 
are  many  different  kinds  of  numeric  databases. 
Scientific,  technical  and  engineering  databases 
comprise  the  second  highest  subject  cate^iy  of 
all  numeric  databases  next  to  business  databases.^ 

As  DTICs  mission  is  to  facilitate  the  transfer  of 
scientific  and  technical  information,  this  is  our 
primary  numeric  daubase  subjea  area  of  interest 
Within  that  category,  the  many  types  of  online 
sources  include  chemical,  physkal,  and  materials 
properties  databases.  Materials  properties  is  an 
area  in  which  many  DoD  engineers  and  scientists 
have  an  interest,  as  products  fiom  aircraft  and 
missiles  to  clothing  are  produced  fiom  materials. 
Their  interests  lie  in  the  effea  that  conditions 
such  as  stress,  strain,  and  temperature  have  on  a 
variety  of  materials.  To  determine  how  numeric 
data  can  be  processed  in  MIDAS,  DTIC  contacted 
engineers  arid  scientists. 

U.  TECHNOLOGY  ASSESSMENT-USER 
NEEDS  RESEARCH 

DTIC  sponsored  a  Scientific  and  Technical 
Numeric  Database  Technology  Assessment  to 
identify  individuals  who  have  a  need  for  new 
services  that  DTIC  should  provide;  to  identify 
additional  needs  (resources  and  tools)  these 
individuab  have;  and  to  assbt  in  responding  to 
those  needs.  The  assessment  concentrated  on 
work  practices,  media,  and  time  spent  in 
information  seeking  activities.  The  study 
measured  bibliographic  information  and  numeric 
dau  searching  frequency,  primary  sources,  user 
satisfaction  with  sources,  preferred  media, 
materiab  and  materiab  properties  interest, 
accessibility  and  computer  capabilities  and 
requiremenb. 

The  assessment  directly  involves  end  usets- 
engineets  and  sdentbts.  With  86%  of 
respondenu  consbting  of  engineers,  the  remainder 
was  evenly  split  between  sdentbu,  technical 
informatkm  spedaUsb  and/oi  librarians  and 
others.  DTIC  mailed  about  20,000  assessment 
forms  to  individuab  in  the  DoD  and  industry 
between  October  1989  and  January  1990  and 
received  1,601  responses.  Roughly,  73% 
responding  were  DoD  employees,  22%  were  fiom 


industry  and  the  remaining  5%  were  other 
government  agencies  and  universities.  Resulb  are 
presented  below,  grouped  by  major  categories. 

Work  Practices 

Project  management,  research,  and  testing  were 
the  most  frequently  specified  job  ftmctions.  Most 
resportdenb  indicated  they  rely  on  paper  as  the 
me^um  for  acquiring  and  using  scientific  and 
technical  information.  But,  a  significant  minority 
use  floppy  dbk.  Nearly  one-third  spend  over  25% 
of  their  time  in  information  seeking  aaivities. 
However,  nearly  two-thirds  spend  the  same 
amount  of  time  actually  using  scientific  and 
technical  information.  See  Appendix  for  Tables 
1  and  2. 

Information  Needs— Bibliographic  Information 

1.  Frequency  of  Searching  for  Bibliographic 
Infonnatioa 

Approximately  25%  of  the  respondenb  frequently 
SMTch  for  bibliographic  information  (source 
citations). 

Yes,  search:  419  No,  do  not  search:  1,123 

2.  Primary  Sonrecs 

Responses  to  the  request  to  Ibt  sources  revealed 
that  most  of  the  respondenb  rely  on  five  sources 
for  bibliographic  information.  These  are  Usted  in 
the  Appendix  in  Table  3  in  order  of  most  used  to 
least  used. 

3.  Level  of  SatisIhetioD  with  Cmrciit  Scarce 

The  question  was  designed  to  measure  the  level  of 
satbbetion  partidpanb  experience  with  their 
current  bibliographic  sources  regarding 
accessibility,  ease  of  use,  presentation,  quality, 
user  support  and  price.  The  majority  of  responses 
indicate  that  partidpanb  rank  their  data  sources 
as  ’good*  to  ’needs  improvement’.  See  Table  4  in 
the  Appendix. 
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Inlbniiatioa  Needs— Namerlc  Data 

1.  Frequency  of  Searchins  ibr  Materials 
Properties  Numeric  Data 

Regarding  the  fiequenqr  of  searching  for  numeric 
data,  just  under  50%  of  those  responding  search 
for  data. 

Yes,  search  for  data:  751  No,  do  not:  811 

2.  Primary  Sources 

Participants  listed  handbooks  most  frequently  as 
their  primary  source  of  materials  properties 
numeric  data.  Other  sources  are  listed  in  Table 
5  of  the  Appendix. 

3.  Level  of  Satisfaction  with  Current  Source 

The  assessment  was  designed  to  measure  the  level 
of  satisfaction  participants  experience  with  their 
current  data  sources  regarding  accessibility,  ease 
of  use,  presentation  of  data,  quality  of  data,  user 
support  and  price.  The  majority  of  responses 
indicate  that  participants  rank  their  data  sources 
as  ’good*  to  'needs  improvement*.  Table  6  in  the 
Appendix,  summarizes  the  results. 

Preferred  Data  Sources  and  Media 

1.  Preferred  Data  Sources 

This  question  tested  six  data  sources  for  their 
importance  to  the  participants  and  the  preferred 
m^ium  for  working  with  each  of  the  sources. 
The  overall  observation  is  that  engineering 
handbooks,  miliury  handbooks,  internal  laboratory 
reports  and  journals  are  most  important  to  these 
participants  and  that  the  vast  majority  of  the 
respondents  currently  work  in  a  paper  medium. 
See  Table  7  in  the  Appendix. 

2.  Preferred  Media  for  Materials  Properties 

When  asked  to  indicate  media  preference  for 
working  with  materials  properties,  the  respondents 
provided  the  choices  shown  in  Table  8  in  the 
Appendix.  Participants’  preference  for  the  future 
is  to  work  with  information  sources  in  a 
computerized  environment  reducing  their 
dependency  on  paper. 


Dcfiniiig  Materials  and  Materials  Properties  Data 
Interest  and  Accessibility 
Partidpants  indicated  their  levri  of  interest  and 
degree  of  accessibility  in  procuring  information. 
Matmials  and  materi^  properties  adiich  received 
the  most  responses  as  ‘critical*  to  respondents' 
work  are  list^  below. 

Materials  Materials  Properties 

Alloys  Chemical 

Metate  Mechanical 

Physical 
Thermal 

The  materials  and  materials  properties  receiving 
a  significantly  high  level  of  interest  response  but 
were  considered  to  be  either*accessible  with 
difficulty*  or  *  inaccessible  with  current  resources' 
were:  Carbon  Matrix  and  Pofymer  Matrix 
Composites,  and  Optical  and  Thermoradiative. 
See  the  Appendix,  Table  9  for  the  summaiy 
results. 

Uses  of  Materials  Properties  Data-Current  and 
Future  Uses 

Responses  to  this  question  show  that  the 
respondents  consider  computer  simulation  the 
most  significant  increase  in  their  use  of  materials 
properties  data  in  the  future.  The  need  for  this 
information  in  engineering  calculation,  materials 
selection,  materiak  engineering,  structural  design, 
product  testing  and  quality  assurance  decrease  in 
the  future. 

Computer  Usage 

Most  of  the  partidpants  responding  to  the 
question  *Do  you  use  a  computer  in  your  work?* 
gave  a  positive  response;  1,460  answered  yes  and 
only  114  answered  no. 

1.  Hardware 

For  this  group  of  respondents,  the  IBM 
compatible  PC  environment  received  the  most 
responses.  Table  10  in  the  Appendix  lists 
hardware  in  decreasing  order  of  use. 
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2.  Graphics 

Most  (80.7%)  of  the  partidpancs  responding 
indicate  that  th^  have  graphics  capabilities,  Le. 
graphics  cards  and  monitors. 

3.  Modem 

With  regard  to  data  communications  capability 
via  modem,  approximately  45%  of  the  respondents 
have  a  modem. 

4.  Software 

The  types  of  software  functions  were  surveyed  and 
results  are  listed  in  Table  11  of  the  Appendix. 

In  a  separate  question  with  a  free  response 
format,  participants  were  asked  to  list  other 
software  functions  that  are  important  to  them. 
These  include:  analysis  programs,  database, 
graphics,  spreadsheet,  and  word  processing. 

III.  INTERVIEWS. -FUNCTIONAL 
DESCRIPTION 

We  developed  a  profile  of  end  users  based  on  the 
tabulation  of  responses  for  two  purposes:  write  a 
functional  description  of  the  MIDAS  prototype 
and  identic  important  sources  of  information  and 
data  required.  The  profile  includes  a 
representative  sample  of  individuals  with  the 
following  attributes: 

o  job  functions  include:  projea  management, 
research,  and  testing 

o  acquisition  and  use  of  scientific  and  technical 
information  consumes  25-50%  of  the  workday 

o  a  critical  interest  in  materials  including  metals, 
alloys,  and  composites 

o  an  interest  in  materiab  properties  that  ate 
accessible  with  difficulty  induding  mechanical, 
physical,  thermal,  chemical  or  inaccessible 
including  thermoradiative  and  optical  properties 

o  search  numeric  dau  and  bibliographic  literatate 
regularly 

o  diskette  is  the  preferred  media  for  numeric  data 
sources 


o  software  function  needs  include  but  are  not 
limited  to:  modeling,  in-house  database,  graphic 
comparison,  statistical  analysis,  unit  conversion, 
CAD/CAM  and  downloading  capabilities. 

Based  on  the  above-mentioned  profile,  we  selected 
individuals  in  the  Washington,  D.C  area  to 
interview  to  obtain  greater  detail  regarding  work 
products,  data  sources  and  needs.  Areas  of 
interest  include:  current  and  future  uses  of 
materials  data,  current  sources  of  materials  data, 
sources  used  in  the  past  and  abandoned,  means  to 
acquire  data,  complaints  about  the  data  sources, 
accuracy  requirements,  and  detailed  work 
environment  descriptions  by  individual  and 
organization  (technology,  computer  environment, 
processes  of  data  input  and  output,  existing 
hardware/software  tools  and  formats,  existing 
programming  facilities  available). 

Interview  Responses 

The  engineers  and  scientists  interviewed  have 
unique  data  applications  and  often  use  different 
resources.  This  aspea  of  the  project  is 
continuing.  However,  we  have  some  preliminary 
findings  regarding  resources,  computer  capabilities, 
dau  usage,  and  software  functions. 

1.  Rcsonrccs 

Many  users  must  perform  literature  searches 
primarily  for  the  purpose  of  determining  whether 
similar  work  has  been  performed  in  the  past  or 
how  ceruin  problems  have  been  solved  I9  other 
organizations.  They  do  not  conduct  their  own 
searches  but  rather  request  their  libraries  to  assist 
Many  complain  that  it  takes  too  long  to  get 
resultt  and  that  often  they  must  justify  why  thqr 
need  the  information  because  of  budgetary  cut¬ 
backs. 

Most  offices  ate  comprised  of  engineers  and 
scientists  who  either  are  long-term  employees  or 
have  very  recently  come  to  work  with  the 
government  Nearly  all  individuals  interviewed 
have  indicated  that  they  have  a  need  to  share 
infonnatfon  with  others  dther  inside  their  own 
organization  or  outside  with  other  agencies  and 
vendors.  Within  an  organization,  there  is  a 
wealth  of  knowledge  to  tap.  Often  these  ate  the 
same  indtvidiials  who  have  points  of  contact 


outside  the  organization.  Most  of  the 
interviewees  rely  heavily  on  communicating  with 
other  experienced,  weU.estabiished  engineera  and 
sdentists  for  information. 

Some  individuals  feel  they  can  not  afford  access 
to  several  desirable  online  services.  Most 
engineers  interviewed  have  at  least  periodic  need 
to  review  military  spedficatioos,  engineering 
handbooks,  and  test  methods  endorsed  by 
American  Society  for  Testing  and  Materials 
(ASTM)  and  other  standards  organizations. 

By  and  large,  most  work  products  are  not  shared 
b^ond  the  immediate  purpose  for  which  the 
information  and  data  were  gathered  and 
generated.  Final  reports  are  often  stored  within 
the  organization  in  hardcopy  with  the  individual 
author  retaining  the  report  on  disk.  A  few 
individuals  do  maintain  a  database  of  information 
so  reports  are  easily  accessible  by  others  for  later 
use. 

2.  Computer  Capabilities 

Most  people  interviewed  are  using  IBM- 
compatible  PCs  that  have  been  upgraded  from  an 
Intel  286  to  the  386  processing  chip.  Within  an 
organization,  most  of  the  individuals  are  either  on 
a  local  area  network  or  are  in  the  process  of 
being  networked.  Many  have  network  access  to 
mini  or  mainframe  computers  on-site.  Usage  for 
such  systems  may  or  may  not  incur  charges. 
Often  there  is  dial-out  capability  from  these 
systems  which  is  rarely  used. 

Many  engineers  use  testing  equipment  with 
emb^ed  computer  software  programs  to 
generate  tables  of  data,  graphs  and  elecuonic 
photographs.  When  compiling  information  and 
data  from  these  systems,  there  is  little  data 
sharing  which  results  in  re-entiy  into  the  PC  for 
report  generation. 

The  individuals  who  most  frequently  use 
computeis  prefer  to  have  their  information  and 
dau  in  some  computerized  fortnaL  The 

individoalt  who  pKtet  paper  tend  not  to  be 
computer  nsen  or  are  more  oomfortat'e  with 
more  traditional  sdentiflc  and  technica]  practices. 
Moat  of  the  interviewees  do  their  own 
programming.  They  feel  they  beat  know  their 


own  needs  and  that  they  ate  most  capable  of 
developing  a  useful  ^tem. 

Many  use  off-the-shelf  software  packages  tot 
putting  together  their  final  products.  Their 
products  consist  of  analysis,  evaluation  and 
recommendation  reports.  Most  individuals 
interviewed  felt  thqp  lacked  the  time  to  learn  new 
packages  and  preferred  the  simple,  easy  to  learn 
software.  Th^  cannot  afford  a  long  fearning 
curve.  Most  frequently  they  cut  and  paste  graphs, 
tables  and  text  accessible  from  different  computer 
systems  and  even  photocopy  pages  from  books. 

3.  Data  Usage 

Many  of  these  individuals  generate  a  lot  of  their 
own  data,  primarily  in  testing  and  research.  Some 
interviewees  are  invoNed  in  the  actual  design, 
materials  selection  and  definition  of  system 
specifications.  Others  identify  problems  with 
components  and  verify  that  the  components  met 
original  specifications.  Typical  data  usage 
includes  stress-strain  analysis,  failure  analysis, 
vibrations  analysis,  environmental  testing, 
prototype  testing,  and  testing  of  developed, 
operational  systems.  Some  interviewees  look  for 
alternative  materials,  some  have  to  identify  the 
materials,  and  for  some,  the  oompositioo  is 
irrelevant.  Most  are  limited  to  specific  materials 
of  interest  but  many  feel  they  must  keep  abreast 
of  other  new  materials  and  technologies. 

4.  Software  Functions 

Most  interviewees  use  many  different  software 
packages  for  pthering  information  and  data  for 
presentation  purposes.  However,  Harvard 
Graphics  was  the  most  frequently  mentioned 
pacbge.  Many  individuals  caqriessed  a  desire  for 
unit  of  measurement  convenion.  Most  of  those 
interviewed  still  report  results  in  English  and  not 
in  metric  units,  and  conversion  is  either  done 
manually  or  through  some  relativefy  simple 
software  programs.  The  moat  frequent  desire  for 
modeling  data  was  finite  analysis.  There  are 
certain  favorite  models  that  these  individuals  use 
and  the  most  significant  problem  is  in  tracking 
dau  generated  from  the  models.  Some 
individnals  only  perform  modeling  with  no  testing 
on  site.  Moat  expressed  a  desire  to  have  graphic 
comparison  capabilities  in  either  two  or  three 
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dimeiBioM.  Often  tliey  have  to  re-enter  dau  into 
there  jractaiget.  Some  inietviewea  have  in-houae 
databaiea  bnt  others  who  do  not,  want  simple 
sjisteiiis  for  tracking  final  reports  and  querying 
specific  laateriab  data  contained  in  those  reports. 

IV.  TECHNOLOGY  ASSESSMETfr-DATARASE 
DEVELOPERS  RESEARCH 

The  purpose  of  the  other  phase  of  the  technology 
assessment,  Database  Developers  Research,  was 
designed  to  identiiy  designers,  developers  and 
distributors  of  scientific  and  technical  numeric 
databases.  Using  a  data  call,  we  collected 
information  regarding  partidpants,  the  database 
and  iu  content,  the  producer  and  the  distributor 
poina  of  contact,  diabase  operations,  system 
capabilities,  and  requiiemena  for  user  access. 
The  data  call  encompassed  any  type  of  scientific 
and  technical  numeric  daubase  and  was  not 
limited  to  materials  properties. 

Dtttiibiiilon 

The  data  call  was  mailed  to  6S0  potential 
database  designers,  providers  and  distributors 
(after  initial  phone  contact)  between  March  and 
June  1990.  These  organizations  were  identified 
from  a  variety  of  information  resources.  The  type 
of  organization  represented  by  these  6M 
individuals  were  classified  as  follows; 


OeO 

219 

Foisign 

107 

Musty  and  OSiar  Qovi 

328 

Total 

600 

RnpoBMi 

Of  the  6S0  data  calls  sent,  231  were  received  for 
an  overall  response  rate  of  33.5%.  The 
organization  affiliation  of  the  respondents  is 
shown  in  the  following  ubie. 
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231 

78 

78 

The  databases  described  in  the  responses  fall  into 
the  following  categories: 


MaSaiWt  Ptopaillaa  riWaham  76 

Othw  Bol/Tteh  Nunwrtc  riWahiMi  75 

NaoWWa  RvaponMS  80 

TaW  231 


IV>r  materials  properties  numeric  databases,  the 
commercial  sector  had  the  largest  number  of 
responses,  32  (42.1%).  Foreign  and  DoD 
responses  each  accounted  for  17  (22.4%)  of  the 
materials  properties  numeric  databases  reported. 
Civilian  government  and  universities  eadi 
accounted  for  S  responses  (6.7%). 

For  other  types  of  scientific  or  technical  numeric 
databases,  tie  DoD  had  the  largest  number  of 
responses,  55  (733%).  The  commercial  sector 
was  second,  with  12  databases  (22.4%).  Civilian 
government  accounted  for  4  responses  (53%), 
universities  for  3  responses  (4.0%)  and  foreign 
organizations  for  1  (13%). 

Data  Call  Updating 

The  data  call  demonstrated  that  numeric  database 
design  and  distribution  is  a  dynamic  processs. 
Developers  frequently  change  addresses,  make 
significut  modifications  to  daubase  capabilities 
or  contents,  and  even  declare  a  daubase  obsolete. 
A  number  of  the  databases  represented  in  DTICs 
’Directory  of  Resources’,  a  directory  accessible  on 
DGIS,  have  been  dedared  Obsolete  or  the 
developen  have  moved  on  to  other  actrviiies, 
leaving  no  one  tcsponsibile  frrr  the  daubase. 
There  is  value  in  identifying  even  daubases  that 
are  obsolete.  Knowing  that  dau  were  collected 
and  perhaps  accessible  could  provide  the  end  user 
with  a  potential  resource. 

Of  the  151  scientific  or  teduucal  numeric 
daubases  represented  in  the  responses,  33 
(21.9%)  are  also  represented  in  DTICs  existing 
Directory  of  Resources.  The  remainder,  118 
(78.1%)  are  daubases  new  to  DTIC^  directory. 
To  maintain  a  current  and  accurate  directory,  the 
frequency  of  dunge  demands  an  update  cy^  of 
at  least  once  a  year. 


Range  of  Datataaece 

The  group  of  scientific  and  technical  databases 
identified  cover  a  broad  range  of  subject  matter 
from  medical  personnel  records  to  oceanic 
temperature  gradients.  Individually,  thqt  cover  a 
spedfic  area  of  interest  designed  for  a  particular 
project  or  select  group  of  users.  Often,  the 
database  designers  have  not  identified  potential 
users  beyond  those  for  whom  the  database  was 
specifically  designed,  lacking  funds  or  the  mission. 

Databaac  Media  and  DGIS 
Though  online  access  or  distribution  for  daubases 
in  general  is  the  most  prevelant  medium,^  for  this 
particular  collection  of  scientific  and  technical 
numeric  databases,  the  most  frequently  used 
medium  is  magnetic  disk. 

DGIS,  with  its  gateway  capabilities  to  directly 
access  a  wide  range  of  databases  would  be  of 
great  value,  offering  many  more  users  access  to 
specialized  data.  However,  this  could  require  a 
significant  investment,  depending  on  the  number 
of  resources  to  make  available  and  on  the 
medium  especially  if  it  is  not  accessible  to  users 
online.  The  extent  of  the  investment  cannot  be 
estimated  without  information  regarding  which 
databases  to  make  available. 

Directory  on  Diskette 

An  effective  approach  in  selecting  databases  to 
distribute  or  make  accessible  would  identify 
databases  that  address  the  largest  and  most 
critical  segments  of  the  user  community.  Based 
on  critical  sources  and  materials  and  materials 
properties  needs  identified  in  the  needs 
assessment,  a  more  detailed  evaluation  of  data 
call  resources  would  identify  those  that  ate 
invaluable  to  the  engiiieer  and  scientist  and  any 
gaps  between  dau  needs  and  the  resources 
available. 

As  a  result  of  this  phase  of  the  assessment,  a 
prototype  directory  on  diskette  was  developed. 
The  purpose  was  two-fold.  One,  to  determine 
how  useful  the  daubases  developed  are  to  the 
end  user.  And,  two,  to  distribute  to  the  user  the 
dau  call  iafi>rmation  collected  in  an  easy-to-use 
product  The  dau  call  responsea  were  input  to 
a  dBASE  tv  daubase.  Using  the  program, 
Clipper,  we  created  an  caecuuble  prt^ram,  that 
is  a  stand-alone  otecauble  file  that  can  be 


invoked  directly  from  DOS  on  a  PC  Therefore, 
the  directory  does  not  require  any  additional 
software  to  operate.  The  resulting  product  is  a 
menu-driven  system  with  pop-up  windows.  Auser 
can  query,  dfrpUy  and  print  resulu  from  the 
daubsee  directory. 

Individual  engineers  and  scientists  who  were 
interviewed  for  the  numeric  processing  tools  are 
also  beu  testing  the  diskette  directory.  The 
information  we  obtain  from  them  will  indude: 
asefUlness  of  directories  of  resources  and  the 
media;  nsefiilness  of  the  resources  identified  and 
others  that  should  be  included;  the  completeness 
of  the  information;  the  directory’s  ease  of  use; 
and  any  other  comments.  This  aspect  of  the 
project  is  continuing. 

Nnaseric  Processing 

As  discussed  earlier,  one  of  the  goals  of  this 
project  is  to  investigate  how  scientists  and 
engineers  use  scientific  and  technical  dau  to  help 
us  design  tools  for  manipulating  data.  The 
responses  to  the  dau  call  show  that  approximately 
37  percent  of  the  daubases  (S6  of  ISl)  provide 
dau  manipulation  or  aiufysis  capabilities.  The 
most  common  ate  plotting  or  other  type  of 
graphical  output  and  unit  conversion. 

V.  CONCLUSION 

DTICs  gateway  system,  DGIS,  will  become  the 
vehicle  by  which  a  wide  range  of  information  and 
dau  services  will  be  accessed  by  intermediaries 
and  end  users.  MIDAS  will  provide  a  complete 
processing  environment  for  both  bibliographic 
information  and  numeric  data.  Based  on  the 
Scientific  and  Technical  Numeric  Daubase 
Technology  Assessment,  DTIC  has  a  much  more 
thorough  understanding  of  engineers  and  scientists 
information  and  dau  resource  needs  and  their 
computing  envitonmenL  Also,  we  have  identified 
scientific  and  technical  numeric  daubases 
throughout  government  and  iitdustry.  Based  on 
deuiled  inwrviews  with  a  select,  tepresenutive 
umple  of  those  needs  assessment  respondents, 
users  of  numeric  dau  want  and  ne^  many 
capabilitiei  that  MIDAS  could  provide.  We  ate 
writing  a  functional  descriptiott  to  provide  the 
basil  of  the  desiga  of  software  functions.  Also, 
we  are  obuining  comments  from  the  community 
regarding  the  asefrilness  of  daubases  contained  in 


the  prototype  diiectoiy. 


The  MIDAS  prototype  is  •  first  step  towards 
handling  a  variety  of  information  and  data  in 
varying  formats.  In  the  future,  information 
seeking  activities  wiii  be  distributed  to  an 
individual’s  workstation.  DTIC  has  already 
provided  the  mechanism  to  access  information 
sources,  DGIS,  but  the  next  step,  a  complete 
information  and  dau  analysis  and  processing 
environment,  has  just  begun. 
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APPENDIX 


Table  1:  Primary  Media 

Number  of 

Percent  of 

Medium 

Responses 

Respondents 

Paper 

964 

61.6 

Fioppy  Disk 

430 

27.5 

Oniine 

159 

10.2 

Other  *1 

11 

0.7 

Total  1,564 

*1  Others  included  CD-ROM,  magnetic  tape.  Microfiche/microfilm,  autocad,  CPU,  mylar 
and  verbal. 


Table  2;  Amount  of  Time  Spent  Acquiring  and  Using  Scientific  and  Technical  Information 


%  of  Time 

%  1 

%  of  Time 

% 

Acouirino 

Resp  1 

1 

Usino 

Resp 

1-10% 

38.9  1 

1 

1-10% 

21.4 

11-25% 

1 

20.3  1 

1 

11-25% 

19.1 

26-50% 

22.2  I 

26-50% 

32.0 

51-100% 

1 

9.6  1 

51-100% 

27.5 

Table  3:  Primary  Sources  for  Bibliographic  Information 


Number  of 

Source  Respondents 

Libraries  75 

one  70 

Dialog  56 

Joumals/open  literature  49 

Chemicai  Abstracts  31 
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Table  4;  Level  of  Satisfaction  with  Aspects  of  Bibliographic  Information  Sources 


Excellent 

Good 

Needs 

Imorovement 

Unsatis- 

factorv 

Total 

Resoonse 

Accessibility 

132 

16.2% 

347  42.7% 

263  32.3% 

71  8.7% 

813 

Ease  of  Use 

105 

12.9% 

390  47.9% 

265  32.6% 

54  6.6% 

814 

Data  Presentation 

66 

8.3% 

396  49.9% 

279  35.2% 

52  6.6% 

793 

Quality  of  Data 

75 

9.3% 

424  52.6% 

254  31.5% 

53  6.6% 

806 

User  Support 

67 

9.2% 

298  40.9% 

266  36.5% 

98  13.4% 

729 

Price 

93 

13.3% 

306  43.8% 

220  31.5% 

79  11.3% 

698 

Table  5:  Sources  of  Materials  Properties  Numeric  Data 


Sources 


Number  of 
Respondents 


Handbooks  (unspecified)  263 

ASM  109 

Journals/Literature  references  100 

Military  data  sheets/spec'*"  J'-  .is  96 

Manufacturing  catalogues  handbooks  93 
Military  handbooks  55 

Technical  papers/reports  or  DTIC  48 


Table  6;  Level  of  Satisfaction  with  Aspects  of 
Numeric  Data  Sources 


Excellent 

Good 

Needs 

Improvement 

Unsatis¬ 

factory 

Total 

Response 

Accessibility 

156 

12.9% 

524  43.2% 

451  37.1% 

83  6.8% 

1214 

Ease  of  Use 

108 

9.0% 

494  41.1% 

492  41.0% 

107  8.9% 

1201 

Data  Presentation 

75 

6.5% 

525  45.4% 

478  41.3% 

79  6.8% 

1157 

Quality  of  Data 

99 

8.5% 

578  49.4% 

413  35.3% 

80  6.8% 

1170 

User  Support 

45 

4.1% 

357  32.8% 

485  44.5% 

203  18.6% 

1090 

Price 

126 

13.1% 

403  42.0% 

301  31.4% 

129  13.5% 

959 

Table  7:  Importance  of  Key  Materials  Properties  Data  Sources 


Source 

Always 

Sometimes 

Never 

Total 

Engineering 

953 

65.8% 

405 

28.0% 

90  6.2% 

1448 

Handbooks 

internal  Laboratory 

432 

32.1% 

636 

47.2% 

279  20.7% 

1347 

Reports 

Journals 

417 

30.6% 

675 

49.6% 

269  19.8% 

1361 

Military 

Handbooks 

590 

43.9% 

509 

37.9% 

244  18.2% 

1343 

Product 

347 

54.7% 

218 

34.4% 

69  10.9% 

634 

Specifications 

Technical  Reports 

275 

47.7% 

244 

42.4% 

57  9.9% 

576 

Table  8;  Preferred  Media 


Medium 

Number  of 
ResDonses 

Percent  of 
Resoonses 

Floppy/Hard  Disk 

698 

30.8 

Paper 

638 

28.2 

On-line 

490 

21.7 

CD-ROM 

247 

10,9 

Microfiche/film 

151 

6.7 

Magnetic  tape 

38 

1.7 

KM  3 


Ill-U 


* 


Table  10:  Hardware 


Number  of 

Type  of  Computer  Responses 


PC  -  IBM  compatible  1 ,231 

Mainframe  with  CRT  456 

Mini  with  CRT  261 

PC  -  Macintosh  242 

Graphics  Workstation  198 


Table  11:  Software  -  Current  Status  and  Needs 


Software  Function 

Currently  Have 

Need 

CAD/CAM 

513 

230 

Downloading 

483 

207 

Graphic  compare 

352 

346 

Inhouse  database 

557 

355 

Modeling 

383 

357 

Statistical  Analysis 

463 

343 

Unit  Conversion 

195 

299 

1. Reva  Basch,  "Database  Software  for  the  1990s  and  Beyond — Part  l; 
The  User's  Wish  List,"  Online.  (March  1990);  17-24. 

2.  Lisa  Day-Copeland,  "Clinical  Research  Finds  PC  Users  Are  More 
Productive  with  GUI  Interface,"  PC  Week.  23  July  1990;  142. 

3.  Robert  D.  Hof,  "Where  Gun  Means  to  be  a  Bigger  Fireball," 
Business  Week.  15  April  1991:  73-74. 

4.  Martha  E.  Williams,  "The  State  of  Databases  Today:  1991; 

Forward,"  Computer-readable  Databases:  A  Directory  and  Data 

Sourcebook,  7th  edition. 


5. Ibid 
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